You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bastien Latard - MDPI AG <la...@mdpi.com.INVALID> on 2016/04/14 16:12:11 UTC

Solr best practices for many to many relations...

Hi Guys,

/I am upgrading from solr 4.2 to 6.0.//
//I successfully (after some time) migrated the config files and other 
parameters.../

Now I'm just wondering if my indexes are following the best 
practices...(and they are probably not :-) )

What would be the best if we have this kind of sql data to write in Solr:


I have several different services which need (more or less), different 
data based on these JOINs...

e.g.:
Service A needs lots of data (but bot all),
Service B needs a few data (some fields already included in A),
Service C needs a bit more data than B(some fields already included in 
A/B)...

*1. Would it be better to create one single index?**
**-> i.e.: this will duplicate journal info for every single article**
**
**2. Would it be better to create several specific indexes for each 
similar services?**
**-> i.e.: this will use more space on the disks (and there are 
~70millions of documents to join)

3. Would it be better to create an index per table and make a join?
-> if yes, how??

*

Kind regards,
Bastien

Re: Solr best practices for many to many relations...

Posted by Joel Bernstein <jo...@gmail.com>.

You may also want to keep an eye on SOLR-8925 which supports distributed,
cross collection graph traversals. This may be useful in traversing the
relationships.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com> wrote:

> Solr now has full distributed join capabilities as part of the Streaming
> Expression library. Keep in mind that these are distributed joins so they
> shuffle records to worker nodes to perform the joins. These are comparable
> to joins done by SQL over MapReduce systems, but they are very responsive
> and can respond with sub-second response time for fairly large joins in
> parallel mode. But these joins do lend themselves to large distributed
> architectures (lot's of shards an replicas). Target QPS also needs to be
> taken into account and tested in deciding whether these joins will meet the
> specific use case.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com> wrote:
>
>> The Streaming API with Streaming Expressions (or Parallel SQL if you want
>> to use SQL) can give you the functionality you're looking for. See
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>> and
>> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
>> SQL queries coming in through the Parallel SQL Interface are translated
>> down into Streaming Expressions - if you need to do something that SQL
>> doesn't yet support you should check out the Streaming Expressions to see
>> if it can support it.
>>
>> With these you could store your data in separate collections (or the same
>> collection with different docType field values) and then during search
>> perform a join (inner, outer, hash) across the collections. You could, if
>> you wanted, even join with data NOT in solr using the jdbc streaming
>> function.
>>
>> - Dennis Gove
>>
>>
>> On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
>> latard@mdpi.com.invalid> wrote:
>>
>>> '*would I then be able to query a specific field of articles or other
>>> "table" (with the same OR BETTER performances)?*'
>>> -> And especially, would I be able to get only 1 article in the result...
>>>
>>> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>>>
>>> Thanks Jack.
>>>
>>> I know that Solr is a search engine, but this replace a search in my
>>> mysql DB with this model:
>>>
>>>
>>> *My goal is to improve my environment (and my performances at the same
>>> time).*
>>>
>>> *Yes, I have a Solr data model... but atm I created 4 different indexes
>>> for "similar service usage".*
>>> *So atm, for 70 millions of documents, I am duplicating journal data and
>>> publisher data all the time in 1 index (for all articles from the same
>>> journal/pub) in order to be able to retrieve all data in 1 query...*
>>>
>>> *I found yesterday that there is the possibility to create like an array
>>> of <entity> in the data-conf.xml.*
>>> e.g. (pseudo code - incomplete):
>>> <entity  name="solr_publisher" query="select name from publishers">
>>> <entity name="solr_journal" query="select name as j_name from journals
>>> WHERE publisher_id='${solr_publisher.id}'">
>>> <entity name="solr_articles" query="select title, abstract from articles
>>> WHERE journal_id='${solr_journal.id}'">
>>> <entity name="solr_authors" query="select given_name, last_name from
>>> authors WHERE article_id='${solr_article.id}'">
>>>
>>>
>>> * Would this be a good option? Is this the denormalization you were
>>> proposing? *
>>>
>>> *If yes, would I then be able to query a specific field of articles or
>>> other "table" (with the same OR BETTER performances)? If yes, I might
>>> probably merge all the different indexes together. *
>>> *I'm currently joining everything in mysql, so duplicating the fields in
>>> the solr (pseudo code):*
>>> <entity  name="all" query="select * from articles INNER JOIN journal on
>>> [...]">
>>> *So I have an index for authors query, a general one for articles (only
>>> needed info of other tables) ...*
>>>
>>> Thanks in advance for the tips. :)
>>>
>>> Kind regards,
>>> Bastien
>>>
>>> On 14/04/2016 16:23, Jack Krupansky wrote:
>>>
>>> Solr is a search engine, not a database.
>>>
>>> JOINs? Although Solr does have some limited JOIN capabilities, they are
>>> more for special situations, not the front-line go-to technique for data
>>> modeling for search.
>>>
>>> Rather, denormalization is the front-line go-to technique for data
>>> modeling in Solr.
>>>
>>> In any case, the first step in data modeling is always to focus on your
>>> queries - what information will be coming into your apps and what
>>> information will the apps want to access based on those inputs.
>>>
>>> But wait... you say you are upgrading, which suggests that you have an
>>> existing Solr data model, and probably queries as well. So...
>>>
>>> 1. Share at least a summary of your existing Solr data model as well as
>>> at least a summary of the kinds of queries you perform today.
>>> 2. Tell us what exacting is driving your inquiry - are queries too slow,
>>> too cumbersome, not sufficiently powerful, or... what exactly is the
>>> problem you need to solve.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
>>> <la...@mdpi.com.invalid> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> *I am upgrading from solr 4.2 to 6.0.*
>>>> *I successfully (after some time) migrated the config files and other
>>>> parameters...*
>>>>
>>>> Now I'm just wondering if my indexes are following the best
>>>> practices...(and they are probably not :-) )
>>>>
>>>> What would be the best if we have this kind of sql data to write in
>>>> Solr:
>>>>
>>>>
>>>> I have several different services which need (more or less), different
>>>> data based on these JOINs...
>>>>
>>>> e.g.:
>>>> Service A needs lots of data (but bot all),
>>>> Service B needs a few data (some fields already included in A),
>>>> Service C needs a bit more data than B(some fields already included in
>>>> A/B)...
>>>>
>>>> *1. Would it be better to create one single index?*
>>>> *-> i.e.: this will duplicate journal info for every single article*
>>>>
>>>> *2. Would it be better to create several specific indexes for each
>>>> similar services?*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *-> i.e.: this will use more space on the disks (and there are
>>>> ~70millions of documents to join) 3. Would it be better to create an index
>>>> per table and make a join? -> if yes, how?? *
>>>>
>>>> Kind regards,
>>>> Bastien
>>>>
>>>>
>>>
>>> Kind regards,
>>> Bastien Latard
>>> Web engineer
>>> --
>>> MDPI AG
>>> Postfach, CH-4005 Basel, Switzerland
>>> Office: Klybeckstrasse 64, CH-4057
>>> Tel. +41 61 683 77 35
>>> Fax: +41 61 302 89 18
>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>
>>>
>>> Kind regards,
>>> Bastien Latard
>>> Web engineer
>>> --
>>> MDPI AG
>>> Postfach, CH-4005 Basel, Switzerland
>>> Office: Klybeckstrasse 64, CH-4057
>>> Tel. +41 61 683 77 35
>>> Fax: +41 61 302 89 18
>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>
>>>
>>
>

Re: Solr best practices for many to many relations...

Posted by Bastien Latard - MDPI AG <la...@mdpi.com.INVALID>.

Thanks everybody.
Your answers are very interesting, however I'm not sure I'm getting them 
properly (sorry I'm not an expert... it might be evident for you)...

*When you're speaking about denormalization, does it mean:

1. something like that?*

    <entity  name="solr_publisher" query="select name from publishers">
    <entity name="solr_journal" query="select name as j_name from
    journals WHERE publisher_id='${solr_publisher.id}'">
    <entity name="solr_articles" query="select title, abstract from
    articles WHERE journal_id='${solr_journal.id}'">
    <entity name="solr_authors" query="select given_name, last_name from
    authors WHERE article_id='${solr_article.id}'">
    */-> I think that the answer is "no".../*


*2. 1 different index for each SQL table? *
/         -> if yes, how can I then retrieve all the needed data (i.e.: 
intersection)?...JOIN/Streaming exp.?/

Otherwise, when you're speaking about JOIN, is it a join between 2 
different indexes, or between several fields of the same index?

/Reminder: there are around 68 millions of articles, which are all 
linked to 1 journal and 1 publisher...And I have 8 different services 
requesting the data (so I cannot really provide a specific use case)./

*Would it be better/faster to query a single normalized index (all the 
data at the same place **/- but larger index because of duplicated 
data/**), or to query several indexes (smaller indexes, but need to make 
a solr "join")?*

Thanks.

Kind regards,
Bastien


On 15/04/2016 17:20, Jack Krupansky wrote:
> And it may also be that there are whole classes of user for whom
> denormalization is just too heavy a cross to bear and for who a little
> extra money spent on more hardware is a great tradeoff.
>
> And... Lucene's indexing may be superior to your average SQL database, so
> that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
> That would be an interesting benchmark.
>
> -- Jack Krupansky
>
> On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <jo...@gmail.com> wrote:
>
>> I think people are going to be surprised though by the speed of the joins.
>> The joins also get faster as the number of shards, replicas and worker
>> nodes grow in the cluster. So we may see people building out large clusters
>> and and using the joins in OLTP scenarios.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupansky@gmail.com
>> wrote:
>>
>>> And of course it depends on the specific queries, both in terms of what
>>> fields will be searched and which fields need to be returned.
>>>
>>> Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
>>> seconds for a complex query may be just fine vs. OLTP/search where under
>>> 150 ms is the target. But, again, it will depend on the nature of the
>>> query, the cardinality of each search field, the cross product of
>>> cardinality of search fields, etc.
>>>
>>> -- Jack Krupansky
>>>
>>> On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <jo...@gmail.com>
>>> wrote:
>>>
>>>> In general the Streaming Expression joins are designed for interactive
>>> OLAP
>>>> type work loads. So BI and data warehousing scenarios are the sweet
>> spot.
>>>> There may be scenarios where high QPS search applications will work
>> with
>>>> the distributed joins, particularly if the joins themselves are not
>> huge.
>>>> But the specific use cases need to be tested.
>>>>
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/
>>>>
>>>> On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
>>> jack.krupansky@gmail.com
>>>> wrote:
>>>>
>>>>> It will be interesting to see which use cases work best with the new
>>>>> streaming JOIN vs. which will remain best with full denormalization,
>> or
>>>>> whether you simply have to try both and benchmark them.
>>>>>
>>>>> My impression had been that streaming JOIN would be ideal for bulk
>>>>> operations rather than traditional-style search queries. Maybe there
>>> are
>>>>> three use cases: bulk read based on broad criteria, top-n relevance
>>>> search
>>>>> query, and specific document (or small number of documents) based on
>>>>> multiple fields.
>>>>>
>>>>> My suspicion is that doing JOIN on five tables will likely be slower
>>> than
>>>>> accessing a single document of a denormalized table/index.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Solr now has full distributed join capabilities as part of the
>>>> Streaming
>>>>>> Expression library. Keep in mind that these are distributed joins
>> so
>>>> they
>>>>>> shuffle records to worker nodes to perform the joins. These are
>>>>> comparable
>>>>>> to joins done by SQL over MapReduce systems, but they are very
>>>> responsive
>>>>>> and can respond with sub-second response time for fairly large
>> joins
>>> in
>>>>>> parallel mode. But these joins do lend themselves to large
>>> distributed
>>>>>> architectures (lot's of shards an replicas). Target QPS also needs
>> to
>>>> be
>>>>>> taken into account and tested in deciding whether these joins will
>>> meet
>>>>> the
>>>>>> specific use case.
>>>>>>
>>>>>> Joel Bernstein
>>>>>> http://joelsolr.blogspot.com/
>>>>>>
>>>>>> On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com>
>>> wrote:
>>>>>>> The Streaming API with Streaming Expressions (or Parallel SQL if
>>> you
>>>>> want
>>>>>>> to use SQL) can give you the functionality you're looking for.
>> See
>>>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>>>>>>> and
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
>>>>>>> SQL queries coming in through the Parallel SQL Interface are
>>>> translated
>>>>>>> down into Streaming Expressions - if you need to do something
>> that
>>>> SQL
>>>>>>> doesn't yet support you should check out the Streaming
>> Expressions
>>> to
>>>>> see
>>>>>>> if it can support it.
>>>>>>>
>>>>>>> With these you could store your data in separate collections (or
>>> the
>>>>> same
>>>>>>> collection with different docType field values) and then during
>>>> search
>>>>>>> perform a join (inner, outer, hash) across the collections. You
>>>> could,
>>>>> if
>>>>>>> you wanted, even join with data NOT in solr using the jdbc
>>> streaming
>>>>>>> function.
>>>>>>>
>>>>>>> - Dennis Gove
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
>>>>>>> latard@mdpi.com.invalid> wrote:
>>>>>>>
>>>>>>>> '*would I then be able to query a specific field of articles or
>>>> other
>>>>>>>> "table" (with the same OR BETTER performances)?*'
>>>>>>>> -> And especially, would I be able to get only 1 article in the
>>>>>> result...
>>>>>>>> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>>>>>>>>
>>>>>>>> Thanks Jack.
>>>>>>>>
>>>>>>>> I know that Solr is a search engine, but this replace a search
>> in
>>> my
>>>>>>>> mysql DB with this model:
>>>>>>>>
>>>>>>>>
>>>>>>>> *My goal is to improve my environment (and my performances at
>> the
>>>> same
>>>>>>>> time).*
>>>>>>>>
>>>>>>>> *Yes, I have a Solr data model... but atm I created 4 different
>>>>> indexes
>>>>>>>> for "similar service usage".*
>>>>>>>> *So atm, for 70 millions of documents, I am duplicating journal
>>> data
>>>>> and
>>>>>>>> publisher data all the time in 1 index (for all articles from
>> the
>>>> same
>>>>>>>> journal/pub) in order to be able to retrieve all data in 1
>>> query...*
>>>>>>>> *I found yesterday that there is the possibility to create like
>> an
>>>>> array
>>>>>>>> of <entity> in the data-conf.xml.*
>>>>>>>> e.g. (pseudo code - incomplete):
>>>>>>>> <entity  name="solr_publisher" query="select name from
>>> publishers">
>>>>>>>> <entity name="solr_journal" query="select name as j_name from
>>>> journals
>>>>>>>> WHERE publisher_id='${solr_publisher.id}'">
>>>>>>>> <entity name="solr_articles" query="select title, abstract from
>>>>> articles
>>>>>>>> WHERE journal_id='${solr_journal.id}'">
>>>>>>>> <entity name="solr_authors" query="select given_name, last_name
>>> from
>>>>>>>> authors WHERE article_id='${solr_article.id}'">
>>>>>>>>
>>>>>>>>
>>>>>>>> * Would this be a good option? Is this the denormalization you
>>> were
>>>>>>>> proposing? *
>>>>>>>>
>>>>>>>> *If yes, would I then be able to query a specific field of
>>> articles
>>>> or
>>>>>>>> other "table" (with the same OR BETTER performances)? If yes, I
>>>> might
>>>>>>>> probably merge all the different indexes together. *
>>>>>>>> *I'm currently joining everything in mysql, so duplicating the
>>>> fields
>>>>> in
>>>>>>>> the solr (pseudo code):*
>>>>>>>> <entity  name="all" query="select * from articles INNER JOIN
>>> journal
>>>>> on
>>>>>>>> [...]">
>>>>>>>> *So I have an index for authors query, a general one for
>> articles
>>>>> (only
>>>>>>>> needed info of other tables) ...*
>>>>>>>>
>>>>>>>> Thanks in advance for the tips. :)
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Bastien
>>>>>>>>
>>>>>>>> On 14/04/2016 16:23, Jack Krupansky wrote:
>>>>>>>>
>>>>>>>> Solr is a search engine, not a database.
>>>>>>>>
>>>>>>>> JOINs? Although Solr does have some limited JOIN capabilities,
>>> they
>>>>> are
>>>>>>>> more for special situations, not the front-line go-to technique
>>> for
>>>>> data
>>>>>>>> modeling for search.
>>>>>>>>
>>>>>>>> Rather, denormalization is the front-line go-to technique for
>> data
>>>>>>>> modeling in Solr.
>>>>>>>>
>>>>>>>> In any case, the first step in data modeling is always to focus
>> on
>>>>> your
>>>>>>>> queries - what information will be coming into your apps and
>> what
>>>>>>>> information will the apps want to access based on those inputs.
>>>>>>>>
>>>>>>>> But wait... you say you are upgrading, which suggests that you
>>> have
>>>> an
>>>>>>>> existing Solr data model, and probably queries as well. So...
>>>>>>>>
>>>>>>>> 1. Share at least a summary of your existing Solr data model as
>>> well
>>>>> as
>>>>>>>> at least a summary of the kinds of queries you perform today.
>>>>>>>> 2. Tell us what exacting is driving your inquiry - are queries
>> too
>>>>> slow,
>>>>>>>> too cumbersome, not sufficiently powerful, or... what exactly is
>>> the
>>>>>>>> problem you need to solve.
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
>>>>>>>> <la...@mdpi.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Guys,
>>>>>>>>>
>>>>>>>>> *I am upgrading from solr 4.2 to 6.0.*
>>>>>>>>> *I successfully (after some time) migrated the config files and
>>>> other
>>>>>>>>> parameters...*
>>>>>>>>>
>>>>>>>>> Now I'm just wondering if my indexes are following the best
>>>>>>>>> practices...(and they are probably not :-) )
>>>>>>>>>
>>>>>>>>> What would be the best if we have this kind of sql data to
>> write
>>> in
>>>>>> Solr:
>>>>>>>>>
>>>>>>>>> I have several different services which need (more or less),
>>>>> different
>>>>>>>>> data based on these JOINs...
>>>>>>>>>
>>>>>>>>> e.g.:
>>>>>>>>> Service A needs lots of data (but bot all),
>>>>>>>>> Service B needs a few data (some fields already included in A),
>>>>>>>>> Service C needs a bit more data than B(some fields already
>>> included
>>>>> in
>>>>>>>>> A/B)...
>>>>>>>>>
>>>>>>>>> *1. Would it be better to create one single index?*
>>>>>>>>> *-> i.e.: this will duplicate journal info for every single
>>>> article*
>>>>>>>>> *2. Would it be better to create several specific indexes for
>>> each
>>>>>>>>> similar services?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *-> i.e.: this will use more space on the disks (and there are
>>>>>>>>> ~70millions of documents to join) 3. Would it be better to
>> create
>>>> an
>>>>>> index
>>>>>>>>> per table and make a join? -> if yes, how?? *
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>> Bastien
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Bastien Latard
>>>>>>>> Web engineer
>>>>>>>> --
>>>>>>>> MDPI AG
>>>>>>>> Postfach, CH-4005 Basel, Switzerland
>>>>>>>> Office: Klybeckstrasse 64, CH-4057
>>>>>>>> Tel. +41 61 683 77 35
>>>>>>>> Fax: +41 61 302 89 18
>>>>>>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Bastien Latard
>>>>>>>> Web engineer
>>>>>>>> --
>>>>>>>> MDPI AG
>>>>>>>> Postfach, CH-4005 Basel, Switzerland
>>>>>>>> Office: Klybeckstrasse 64, CH-4057
>>>>>>>> Tel. +41 61 683 77 35
>>>>>>>> Fax: +41 61 302 89 18
>>>>>>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>>>>>>
>>>>>>>>

Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
latard@mdpi.com
http://www.mdpi.com/

Re: Solr best practices for many to many relations...

Posted by Jack Krupansky <ja...@gmail.com>.

And it may also be that there are whole classes of user for whom
denormalization is just too heavy a cross to bear and for who a little
extra money spent on more hardware is a great tradeoff.

And... Lucene's indexing may be superior to your average SQL database, so
that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
That would be an interesting benchmark.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <jo...@gmail.com> wrote:

> I think people are going to be surprised though by the speed of the joins.
> The joins also get faster as the number of shards, replicas and worker
> nodes grow in the cluster. So we may see people building out large clusters
> and and using the joins in OLTP scenarios.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupansky@gmail.com
> >
> wrote:
>
> > And of course it depends on the specific queries, both in terms of what
> > fields will be searched and which fields need to be returned.
> >
> > Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
> > seconds for a complex query may be just fine vs. OLTP/search where under
> > 150 ms is the target. But, again, it will depend on the nature of the
> > query, the cardinality of each search field, the cross product of
> > cardinality of search fields, etc.
> >
> > -- Jack Krupansky
> >
> > On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > In general the Streaming Expression joins are designed for interactive
> > OLAP
> > > type work loads. So BI and data warehousing scenarios are the sweet
> spot.
> > > There may be scenarios where high QPS search applications will work
> with
> > > the distributed joins, particularly if the joins themselves are not
> huge.
> > > But the specific use cases need to be tested.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
> > jack.krupansky@gmail.com
> > > >
> > > wrote:
> > >
> > > > It will be interesting to see which use cases work best with the new
> > > > streaming JOIN vs. which will remain best with full denormalization,
> or
> > > > whether you simply have to try both and benchmark them.
> > > >
> > > > My impression had been that streaming JOIN would be ideal for bulk
> > > > operations rather than traditional-style search queries. Maybe there
> > are
> > > > three use cases: bulk read based on broad criteria, top-n relevance
> > > search
> > > > query, and specific document (or small number of documents) based on
> > > > multiple fields.
> > > >
> > > > My suspicion is that doing JOIN on five tables will likely be slower
> > than
> > > > accessing a single document of a denormalized table/index.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com>
> > > > wrote:
> > > >
> > > > > Solr now has full distributed join capabilities as part of the
> > > Streaming
> > > > > Expression library. Keep in mind that these are distributed joins
> so
> > > they
> > > > > shuffle records to worker nodes to perform the joins. These are
> > > > comparable
> > > > > to joins done by SQL over MapReduce systems, but they are very
> > > responsive
> > > > > and can respond with sub-second response time for fairly large
> joins
> > in
> > > > > parallel mode. But these joins do lend themselves to large
> > distributed
> > > > > architectures (lot's of shards an replicas). Target QPS also needs
> to
> > > be
> > > > > taken into account and tested in deciding whether these joins will
> > meet
> > > > the
> > > > > specific use case.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com>
> > wrote:
> > > > >
> > > > > > The Streaming API with Streaming Expressions (or Parallel SQL if
> > you
> > > > want
> > > > > > to use SQL) can give you the functionality you're looking for.
> See
> > > > > >
> > > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > > > and
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > > > SQL queries coming in through the Parallel SQL Interface are
> > > translated
> > > > > > down into Streaming Expressions - if you need to do something
> that
> > > SQL
> > > > > > doesn't yet support you should check out the Streaming
> Expressions
> > to
> > > > see
> > > > > > if it can support it.
> > > > > >
> > > > > > With these you could store your data in separate collections (or
> > the
> > > > same
> > > > > > collection with different docType field values) and then during
> > > search
> > > > > > perform a join (inner, outer, hash) across the collections. You
> > > could,
> > > > if
> > > > > > you wanted, even join with data NOT in solr using the jdbc
> > streaming
> > > > > > function.
> > > > > >
> > > > > > - Dennis Gove
> > > > > >
> > > > > >
> > > > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > > > latard@mdpi.com.invalid> wrote:
> > > > > >
> > > > > >> '*would I then be able to query a specific field of articles or
> > > other
> > > > > >> "table" (with the same OR BETTER performances)?*'
> > > > > >> -> And especially, would I be able to get only 1 article in the
> > > > > result...
> > > > > >>
> > > > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > > > >>
> > > > > >> Thanks Jack.
> > > > > >>
> > > > > >> I know that Solr is a search engine, but this replace a search
> in
> > my
> > > > > >> mysql DB with this model:
> > > > > >>
> > > > > >>
> > > > > >> *My goal is to improve my environment (and my performances at
> the
> > > same
> > > > > >> time).*
> > > > > >>
> > > > > >> *Yes, I have a Solr data model... but atm I created 4 different
> > > > indexes
> > > > > >> for "similar service usage".*
> > > > > >> *So atm, for 70 millions of documents, I am duplicating journal
> > data
> > > > and
> > > > > >> publisher data all the time in 1 index (for all articles from
> the
> > > same
> > > > > >> journal/pub) in order to be able to retrieve all data in 1
> > query...*
> > > > > >>
> > > > > >> *I found yesterday that there is the possibility to create like
> an
> > > > array
> > > > > >> of <entity> in the data-conf.xml.*
> > > > > >> e.g. (pseudo code - incomplete):
> > > > > >> <entity  name="solr_publisher" query="select name from
> > publishers">
> > > > > >> <entity name="solr_journal" query="select name as j_name from
> > > journals
> > > > > >> WHERE publisher_id='${solr_publisher.id}'">
> > > > > >> <entity name="solr_articles" query="select title, abstract from
> > > > articles
> > > > > >> WHERE journal_id='${solr_journal.id}'">
> > > > > >> <entity name="solr_authors" query="select given_name, last_name
> > from
> > > > > >> authors WHERE article_id='${solr_article.id}'">
> > > > > >>
> > > > > >>
> > > > > >> * Would this be a good option? Is this the denormalization you
> > were
> > > > > >> proposing? *
> > > > > >>
> > > > > >> *If yes, would I then be able to query a specific field of
> > articles
> > > or
> > > > > >> other "table" (with the same OR BETTER performances)? If yes, I
> > > might
> > > > > >> probably merge all the different indexes together. *
> > > > > >> *I'm currently joining everything in mysql, so duplicating the
> > > fields
> > > > in
> > > > > >> the solr (pseudo code):*
> > > > > >> <entity  name="all" query="select * from articles INNER JOIN
> > journal
> > > > on
> > > > > >> [...]">
> > > > > >> *So I have an index for authors query, a general one for
> articles
> > > > (only
> > > > > >> needed info of other tables) ...*
> > > > > >>
> > > > > >> Thanks in advance for the tips. :)
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien
> > > > > >>
> > > > > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > > > > >>
> > > > > >> Solr is a search engine, not a database.
> > > > > >>
> > > > > >> JOINs? Although Solr does have some limited JOIN capabilities,
> > they
> > > > are
> > > > > >> more for special situations, not the front-line go-to technique
> > for
> > > > data
> > > > > >> modeling for search.
> > > > > >>
> > > > > >> Rather, denormalization is the front-line go-to technique for
> data
> > > > > >> modeling in Solr.
> > > > > >>
> > > > > >> In any case, the first step in data modeling is always to focus
> on
> > > > your
> > > > > >> queries - what information will be coming into your apps and
> what
> > > > > >> information will the apps want to access based on those inputs.
> > > > > >>
> > > > > >> But wait... you say you are upgrading, which suggests that you
> > have
> > > an
> > > > > >> existing Solr data model, and probably queries as well. So...
> > > > > >>
> > > > > >> 1. Share at least a summary of your existing Solr data model as
> > well
> > > > as
> > > > > >> at least a summary of the kinds of queries you perform today.
> > > > > >> 2. Tell us what exacting is driving your inquiry - are queries
> too
> > > > slow,
> > > > > >> too cumbersome, not sufficiently powerful, or... what exactly is
> > the
> > > > > >> problem you need to solve.
> > > > > >>
> > > > > >>
> > > > > >> -- Jack Krupansky
> > > > > >>
> > > > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > > > > >> <la...@mdpi.com.invalid> wrote:
> > > > > >>
> > > > > >>> Hi Guys,
> > > > > >>>
> > > > > >>> *I am upgrading from solr 4.2 to 6.0.*
> > > > > >>> *I successfully (after some time) migrated the config files and
> > > other
> > > > > >>> parameters...*
> > > > > >>>
> > > > > >>> Now I'm just wondering if my indexes are following the best
> > > > > >>> practices...(and they are probably not :-) )
> > > > > >>>
> > > > > >>> What would be the best if we have this kind of sql data to
> write
> > in
> > > > > Solr:
> > > > > >>>
> > > > > >>>
> > > > > >>> I have several different services which need (more or less),
> > > > different
> > > > > >>> data based on these JOINs...
> > > > > >>>
> > > > > >>> e.g.:
> > > > > >>> Service A needs lots of data (but bot all),
> > > > > >>> Service B needs a few data (some fields already included in A),
> > > > > >>> Service C needs a bit more data than B(some fields already
> > included
> > > > in
> > > > > >>> A/B)...
> > > > > >>>
> > > > > >>> *1. Would it be better to create one single index?*
> > > > > >>> *-> i.e.: this will duplicate journal info for every single
> > > article*
> > > > > >>>
> > > > > >>> *2. Would it be better to create several specific indexes for
> > each
> > > > > >>> similar services?*
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> *-> i.e.: this will use more space on the disks (and there are
> > > > > >>> ~70millions of documents to join) 3. Would it be better to
> create
> > > an
> > > > > index
> > > > > >>> per table and make a join? -> if yes, how?? *
> > > > > >>>
> > > > > >>> Kind regards,
> > > > > >>> Bastien
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien Latard
> > > > > >> Web engineer
> > > > > >> --
> > > > > >> MDPI AG
> > > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > > >> Tel. +41 61 683 77 35
> > > > > >> Fax: +41 61 302 89 18
> > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > > >>
> > > > > >>
> > > > > >> Kind regards,
> > > > > >> Bastien Latard
> > > > > >> Web engineer
> > > > > >> --
> > > > > >> MDPI AG
> > > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > > >> Tel. +41 61 683 77 35
> > > > > >> Fax: +41 61 302 89 18
> > > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Solr best practices for many to many relations...

Posted by Joel Bernstein <jo...@gmail.com>.

I think people are going to be surprised though by the speed of the joins.
The joins also get faster as the number of shards, replicas and worker
nodes grow in the cluster. So we may see people building out large clusters
and and using the joins in OLTP scenarios.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> And of course it depends on the specific queries, both in terms of what
> fields will be searched and which fields need to be returned.
>
> Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
> seconds for a complex query may be just fine vs. OLTP/search where under
> 150 ms is the target. But, again, it will depend on the nature of the
> query, the cardinality of each search field, the cross product of
> cardinality of search fields, etc.
>
> -- Jack Krupansky
>
> On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > In general the Streaming Expression joins are designed for interactive
> OLAP
> > type work loads. So BI and data warehousing scenarios are the sweet spot.
> > There may be scenarios where high QPS search applications will work with
> > the distributed joins, particularly if the joins themselves are not huge.
> > But the specific use cases need to be tested.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
> jack.krupansky@gmail.com
> > >
> > wrote:
> >
> > > It will be interesting to see which use cases work best with the new
> > > streaming JOIN vs. which will remain best with full denormalization, or
> > > whether you simply have to try both and benchmark them.
> > >
> > > My impression had been that streaming JOIN would be ideal for bulk
> > > operations rather than traditional-style search queries. Maybe there
> are
> > > three use cases: bulk read based on broad criteria, top-n relevance
> > search
> > > query, and specific document (or small number of documents) based on
> > > multiple fields.
> > >
> > > My suspicion is that doing JOIN on five tables will likely be slower
> than
> > > accessing a single document of a denormalized table/index.
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com>
> > > wrote:
> > >
> > > > Solr now has full distributed join capabilities as part of the
> > Streaming
> > > > Expression library. Keep in mind that these are distributed joins so
> > they
> > > > shuffle records to worker nodes to perform the joins. These are
> > > comparable
> > > > to joins done by SQL over MapReduce systems, but they are very
> > responsive
> > > > and can respond with sub-second response time for fairly large joins
> in
> > > > parallel mode. But these joins do lend themselves to large
> distributed
> > > > architectures (lot's of shards an replicas). Target QPS also needs to
> > be
> > > > taken into account and tested in deciding whether these joins will
> meet
> > > the
> > > > specific use case.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com>
> wrote:
> > > >
> > > > > The Streaming API with Streaming Expressions (or Parallel SQL if
> you
> > > want
> > > > > to use SQL) can give you the functionality you're looking for. See
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > > and
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > > SQL queries coming in through the Parallel SQL Interface are
> > translated
> > > > > down into Streaming Expressions - if you need to do something that
> > SQL
> > > > > doesn't yet support you should check out the Streaming Expressions
> to
> > > see
> > > > > if it can support it.
> > > > >
> > > > > With these you could store your data in separate collections (or
> the
> > > same
> > > > > collection with different docType field values) and then during
> > search
> > > > > perform a join (inner, outer, hash) across the collections. You
> > could,
> > > if
> > > > > you wanted, even join with data NOT in solr using the jdbc
> streaming
> > > > > function.
> > > > >
> > > > > - Dennis Gove
> > > > >
> > > > >
> > > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > > latard@mdpi.com.invalid> wrote:
> > > > >
> > > > >> '*would I then be able to query a specific field of articles or
> > other
> > > > >> "table" (with the same OR BETTER performances)?*'
> > > > >> -> And especially, would I be able to get only 1 article in the
> > > > result...
> > > > >>
> > > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > > >>
> > > > >> Thanks Jack.
> > > > >>
> > > > >> I know that Solr is a search engine, but this replace a search in
> my
> > > > >> mysql DB with this model:
> > > > >>
> > > > >>
> > > > >> *My goal is to improve my environment (and my performances at the
> > same
> > > > >> time).*
> > > > >>
> > > > >> *Yes, I have a Solr data model... but atm I created 4 different
> > > indexes
> > > > >> for "similar service usage".*
> > > > >> *So atm, for 70 millions of documents, I am duplicating journal
> data
> > > and
> > > > >> publisher data all the time in 1 index (for all articles from the
> > same
> > > > >> journal/pub) in order to be able to retrieve all data in 1
> query...*
> > > > >>
> > > > >> *I found yesterday that there is the possibility to create like an
> > > array
> > > > >> of <entity> in the data-conf.xml.*
> > > > >> e.g. (pseudo code - incomplete):
> > > > >> <entity  name="solr_publisher" query="select name from
> publishers">
> > > > >> <entity name="solr_journal" query="select name as j_name from
> > journals
> > > > >> WHERE publisher_id='${solr_publisher.id}'">
> > > > >> <entity name="solr_articles" query="select title, abstract from
> > > articles
> > > > >> WHERE journal_id='${solr_journal.id}'">
> > > > >> <entity name="solr_authors" query="select given_name, last_name
> from
> > > > >> authors WHERE article_id='${solr_article.id}'">
> > > > >>
> > > > >>
> > > > >> * Would this be a good option? Is this the denormalization you
> were
> > > > >> proposing? *
> > > > >>
> > > > >> *If yes, would I then be able to query a specific field of
> articles
> > or
> > > > >> other "table" (with the same OR BETTER performances)? If yes, I
> > might
> > > > >> probably merge all the different indexes together. *
> > > > >> *I'm currently joining everything in mysql, so duplicating the
> > fields
> > > in
> > > > >> the solr (pseudo code):*
> > > > >> <entity  name="all" query="select * from articles INNER JOIN
> journal
> > > on
> > > > >> [...]">
> > > > >> *So I have an index for authors query, a general one for articles
> > > (only
> > > > >> needed info of other tables) ...*
> > > > >>
> > > > >> Thanks in advance for the tips. :)
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien
> > > > >>
> > > > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > > > >>
> > > > >> Solr is a search engine, not a database.
> > > > >>
> > > > >> JOINs? Although Solr does have some limited JOIN capabilities,
> they
> > > are
> > > > >> more for special situations, not the front-line go-to technique
> for
> > > data
> > > > >> modeling for search.
> > > > >>
> > > > >> Rather, denormalization is the front-line go-to technique for data
> > > > >> modeling in Solr.
> > > > >>
> > > > >> In any case, the first step in data modeling is always to focus on
> > > your
> > > > >> queries - what information will be coming into your apps and what
> > > > >> information will the apps want to access based on those inputs.
> > > > >>
> > > > >> But wait... you say you are upgrading, which suggests that you
> have
> > an
> > > > >> existing Solr data model, and probably queries as well. So...
> > > > >>
> > > > >> 1. Share at least a summary of your existing Solr data model as
> well
> > > as
> > > > >> at least a summary of the kinds of queries you perform today.
> > > > >> 2. Tell us what exacting is driving your inquiry - are queries too
> > > slow,
> > > > >> too cumbersome, not sufficiently powerful, or... what exactly is
> the
> > > > >> problem you need to solve.
> > > > >>
> > > > >>
> > > > >> -- Jack Krupansky
> > > > >>
> > > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > > > >> <la...@mdpi.com.invalid> wrote:
> > > > >>
> > > > >>> Hi Guys,
> > > > >>>
> > > > >>> *I am upgrading from solr 4.2 to 6.0.*
> > > > >>> *I successfully (after some time) migrated the config files and
> > other
> > > > >>> parameters...*
> > > > >>>
> > > > >>> Now I'm just wondering if my indexes are following the best
> > > > >>> practices...(and they are probably not :-) )
> > > > >>>
> > > > >>> What would be the best if we have this kind of sql data to write
> in
> > > > Solr:
> > > > >>>
> > > > >>>
> > > > >>> I have several different services which need (more or less),
> > > different
> > > > >>> data based on these JOINs...
> > > > >>>
> > > > >>> e.g.:
> > > > >>> Service A needs lots of data (but bot all),
> > > > >>> Service B needs a few data (some fields already included in A),
> > > > >>> Service C needs a bit more data than B(some fields already
> included
> > > in
> > > > >>> A/B)...
> > > > >>>
> > > > >>> *1. Would it be better to create one single index?*
> > > > >>> *-> i.e.: this will duplicate journal info for every single
> > article*
> > > > >>>
> > > > >>> *2. Would it be better to create several specific indexes for
> each
> > > > >>> similar services?*
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *-> i.e.: this will use more space on the disks (and there are
> > > > >>> ~70millions of documents to join) 3. Would it be better to create
> > an
> > > > index
> > > > >>> per table and make a join? -> if yes, how?? *
> > > > >>>
> > > > >>> Kind regards,
> > > > >>> Bastien
> > > > >>>
> > > > >>>
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien Latard
> > > > >> Web engineer
> > > > >> --
> > > > >> MDPI AG
> > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > >> Tel. +41 61 683 77 35
> > > > >> Fax: +41 61 302 89 18
> > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > >>
> > > > >>
> > > > >> Kind regards,
> > > > >> Bastien Latard
> > > > >> Web engineer
> > > > >> --
> > > > >> MDPI AG
> > > > >> Postfach, CH-4005 Basel, Switzerland
> > > > >> Office: Klybeckstrasse 64, CH-4057
> > > > >> Tel. +41 61 683 77 35
> > > > >> Fax: +41 61 302 89 18
> > > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Solr best practices for many to many relations...

Posted by Jack Krupansky <ja...@gmail.com>.

And of course it depends on the specific queries, both in terms of what
fields will be searched and which fields need to be returned.

Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
seconds for a complex query may be just fine vs. OLTP/search where under
150 ms is the target. But, again, it will depend on the nature of the
query, the cardinality of each search field, the cross product of
cardinality of search fields, etc.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <jo...@gmail.com> wrote:

> In general the Streaming Expression joins are designed for interactive OLAP
> type work loads. So BI and data warehousing scenarios are the sweet spot.
> There may be scenarios where high QPS search applications will work with
> the distributed joins, particularly if the joins themselves are not huge.
> But the specific use cases need to be tested.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <jack.krupansky@gmail.com
> >
> wrote:
>
> > It will be interesting to see which use cases work best with the new
> > streaming JOIN vs. which will remain best with full denormalization, or
> > whether you simply have to try both and benchmark them.
> >
> > My impression had been that streaming JOIN would be ideal for bulk
> > operations rather than traditional-style search queries. Maybe there are
> > three use cases: bulk read based on broad criteria, top-n relevance
> search
> > query, and specific document (or small number of documents) based on
> > multiple fields.
> >
> > My suspicion is that doing JOIN on five tables will likely be slower than
> > accessing a single document of a denormalized table/index.
> >
> > -- Jack Krupansky
> >
> > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > Solr now has full distributed join capabilities as part of the
> Streaming
> > > Expression library. Keep in mind that these are distributed joins so
> they
> > > shuffle records to worker nodes to perform the joins. These are
> > comparable
> > > to joins done by SQL over MapReduce systems, but they are very
> responsive
> > > and can respond with sub-second response time for fairly large joins in
> > > parallel mode. But these joins do lend themselves to large distributed
> > > architectures (lot's of shards an replicas). Target QPS also needs to
> be
> > > taken into account and tested in deciding whether these joins will meet
> > the
> > > specific use case.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com> wrote:
> > >
> > > > The Streaming API with Streaming Expressions (or Parallel SQL if you
> > want
> > > > to use SQL) can give you the functionality you're looking for. See
> > > >
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > and
> > > >
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > SQL queries coming in through the Parallel SQL Interface are
> translated
> > > > down into Streaming Expressions - if you need to do something that
> SQL
> > > > doesn't yet support you should check out the Streaming Expressions to
> > see
> > > > if it can support it.
> > > >
> > > > With these you could store your data in separate collections (or the
> > same
> > > > collection with different docType field values) and then during
> search
> > > > perform a join (inner, outer, hash) across the collections. You
> could,
> > if
> > > > you wanted, even join with data NOT in solr using the jdbc streaming
> > > > function.
> > > >
> > > > - Dennis Gove
> > > >
> > > >
> > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > latard@mdpi.com.invalid> wrote:
> > > >
> > > >> '*would I then be able to query a specific field of articles or
> other
> > > >> "table" (with the same OR BETTER performances)?*'
> > > >> -> And especially, would I be able to get only 1 article in the
> > > result...
> > > >>
> > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > >>
> > > >> Thanks Jack.
> > > >>
> > > >> I know that Solr is a search engine, but this replace a search in my
> > > >> mysql DB with this model:
> > > >>
> > > >>
> > > >> *My goal is to improve my environment (and my performances at the
> same
> > > >> time).*
> > > >>
> > > >> *Yes, I have a Solr data model... but atm I created 4 different
> > indexes
> > > >> for "similar service usage".*
> > > >> *So atm, for 70 millions of documents, I am duplicating journal data
> > and
> > > >> publisher data all the time in 1 index (for all articles from the
> same
> > > >> journal/pub) in order to be able to retrieve all data in 1 query...*
> > > >>
> > > >> *I found yesterday that there is the possibility to create like an
> > array
> > > >> of <entity> in the data-conf.xml.*
> > > >> e.g. (pseudo code - incomplete):
> > > >> <entity  name="solr_publisher" query="select name from publishers">
> > > >> <entity name="solr_journal" query="select name as j_name from
> journals
> > > >> WHERE publisher_id='${solr_publisher.id}'">
> > > >> <entity name="solr_articles" query="select title, abstract from
> > articles
> > > >> WHERE journal_id='${solr_journal.id}'">
> > > >> <entity name="solr_authors" query="select given_name, last_name from
> > > >> authors WHERE article_id='${solr_article.id}'">
> > > >>
> > > >>
> > > >> * Would this be a good option? Is this the denormalization you were
> > > >> proposing? *
> > > >>
> > > >> *If yes, would I then be able to query a specific field of articles
> or
> > > >> other "table" (with the same OR BETTER performances)? If yes, I
> might
> > > >> probably merge all the different indexes together. *
> > > >> *I'm currently joining everything in mysql, so duplicating the
> fields
> > in
> > > >> the solr (pseudo code):*
> > > >> <entity  name="all" query="select * from articles INNER JOIN journal
> > on
> > > >> [...]">
> > > >> *So I have an index for authors query, a general one for articles
> > (only
> > > >> needed info of other tables) ...*
> > > >>
> > > >> Thanks in advance for the tips. :)
> > > >>
> > > >> Kind regards,
> > > >> Bastien
> > > >>
> > > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > > >>
> > > >> Solr is a search engine, not a database.
> > > >>
> > > >> JOINs? Although Solr does have some limited JOIN capabilities, they
> > are
> > > >> more for special situations, not the front-line go-to technique for
> > data
> > > >> modeling for search.
> > > >>
> > > >> Rather, denormalization is the front-line go-to technique for data
> > > >> modeling in Solr.
> > > >>
> > > >> In any case, the first step in data modeling is always to focus on
> > your
> > > >> queries - what information will be coming into your apps and what
> > > >> information will the apps want to access based on those inputs.
> > > >>
> > > >> But wait... you say you are upgrading, which suggests that you have
> an
> > > >> existing Solr data model, and probably queries as well. So...
> > > >>
> > > >> 1. Share at least a summary of your existing Solr data model as well
> > as
> > > >> at least a summary of the kinds of queries you perform today.
> > > >> 2. Tell us what exacting is driving your inquiry - are queries too
> > slow,
> > > >> too cumbersome, not sufficiently powerful, or... what exactly is the
> > > >> problem you need to solve.
> > > >>
> > > >>
> > > >> -- Jack Krupansky
> > > >>
> > > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > > >> <la...@mdpi.com.invalid> wrote:
> > > >>
> > > >>> Hi Guys,
> > > >>>
> > > >>> *I am upgrading from solr 4.2 to 6.0.*
> > > >>> *I successfully (after some time) migrated the config files and
> other
> > > >>> parameters...*
> > > >>>
> > > >>> Now I'm just wondering if my indexes are following the best
> > > >>> practices...(and they are probably not :-) )
> > > >>>
> > > >>> What would be the best if we have this kind of sql data to write in
> > > Solr:
> > > >>>
> > > >>>
> > > >>> I have several different services which need (more or less),
> > different
> > > >>> data based on these JOINs...
> > > >>>
> > > >>> e.g.:
> > > >>> Service A needs lots of data (but bot all),
> > > >>> Service B needs a few data (some fields already included in A),
> > > >>> Service C needs a bit more data than B(some fields already included
> > in
> > > >>> A/B)...
> > > >>>
> > > >>> *1. Would it be better to create one single index?*
> > > >>> *-> i.e.: this will duplicate journal info for every single
> article*
> > > >>>
> > > >>> *2. Would it be better to create several specific indexes for each
> > > >>> similar services?*
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> *-> i.e.: this will use more space on the disks (and there are
> > > >>> ~70millions of documents to join) 3. Would it be better to create
> an
> > > index
> > > >>> per table and make a join? -> if yes, how?? *
> > > >>>
> > > >>> Kind regards,
> > > >>> Bastien
> > > >>>
> > > >>>
> > > >>
> > > >> Kind regards,
> > > >> Bastien Latard
> > > >> Web engineer
> > > >> --
> > > >> MDPI AG
> > > >> Postfach, CH-4005 Basel, Switzerland
> > > >> Office: Klybeckstrasse 64, CH-4057
> > > >> Tel. +41 61 683 77 35
> > > >> Fax: +41 61 302 89 18
> > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > >>
> > > >>
> > > >> Kind regards,
> > > >> Bastien Latard
> > > >> Web engineer
> > > >> --
> > > >> MDPI AG
> > > >> Postfach, CH-4005 Basel, Switzerland
> > > >> Office: Klybeckstrasse 64, CH-4057
> > > >> Tel. +41 61 683 77 35
> > > >> Fax: +41 61 302 89 18
> > > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Solr best practices for many to many relations...

Posted by Joel Bernstein <jo...@gmail.com>.

In general the Streaming Expression joins are designed for interactive OLAP
type work loads. So BI and data warehousing scenarios are the sweet spot.
There may be scenarios where high QPS search applications will work with
the distributed joins, particularly if the joins themselves are not huge.
But the specific use cases need to be tested.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> It will be interesting to see which use cases work best with the new
> streaming JOIN vs. which will remain best with full denormalization, or
> whether you simply have to try both and benchmark them.
>
> My impression had been that streaming JOIN would be ideal for bulk
> operations rather than traditional-style search queries. Maybe there are
> three use cases: bulk read based on broad criteria, top-n relevance search
> query, and specific document (or small number of documents) based on
> multiple fields.
>
> My suspicion is that doing JOIN on five tables will likely be slower than
> accessing a single document of a denormalized table/index.
>
> -- Jack Krupansky
>
> On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > Solr now has full distributed join capabilities as part of the Streaming
> > Expression library. Keep in mind that these are distributed joins so they
> > shuffle records to worker nodes to perform the joins. These are
> comparable
> > to joins done by SQL over MapReduce systems, but they are very responsive
> > and can respond with sub-second response time for fairly large joins in
> > parallel mode. But these joins do lend themselves to large distributed
> > architectures (lot's of shards an replicas). Target QPS also needs to be
> > taken into account and tested in deciding whether these joins will meet
> the
> > specific use case.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com> wrote:
> >
> > > The Streaming API with Streaming Expressions (or Parallel SQL if you
> want
> > > to use SQL) can give you the functionality you're looking for. See
> > > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > and
> > >
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > SQL queries coming in through the Parallel SQL Interface are translated
> > > down into Streaming Expressions - if you need to do something that SQL
> > > doesn't yet support you should check out the Streaming Expressions to
> see
> > > if it can support it.
> > >
> > > With these you could store your data in separate collections (or the
> same
> > > collection with different docType field values) and then during search
> > > perform a join (inner, outer, hash) across the collections. You could,
> if
> > > you wanted, even join with data NOT in solr using the jdbc streaming
> > > function.
> > >
> > > - Dennis Gove
> > >
> > >
> > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > latard@mdpi.com.invalid> wrote:
> > >
> > >> '*would I then be able to query a specific field of articles or other
> > >> "table" (with the same OR BETTER performances)?*'
> > >> -> And especially, would I be able to get only 1 article in the
> > result...
> > >>
> > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > >>
> > >> Thanks Jack.
> > >>
> > >> I know that Solr is a search engine, but this replace a search in my
> > >> mysql DB with this model:
> > >>
> > >>
> > >> *My goal is to improve my environment (and my performances at the same
> > >> time).*
> > >>
> > >> *Yes, I have a Solr data model... but atm I created 4 different
> indexes
> > >> for "similar service usage".*
> > >> *So atm, for 70 millions of documents, I am duplicating journal data
> and
> > >> publisher data all the time in 1 index (for all articles from the same
> > >> journal/pub) in order to be able to retrieve all data in 1 query...*
> > >>
> > >> *I found yesterday that there is the possibility to create like an
> array
> > >> of <entity> in the data-conf.xml.*
> > >> e.g. (pseudo code - incomplete):
> > >> <entity  name="solr_publisher" query="select name from publishers">
> > >> <entity name="solr_journal" query="select name as j_name from journals
> > >> WHERE publisher_id='${solr_publisher.id}'">
> > >> <entity name="solr_articles" query="select title, abstract from
> articles
> > >> WHERE journal_id='${solr_journal.id}'">
> > >> <entity name="solr_authors" query="select given_name, last_name from
> > >> authors WHERE article_id='${solr_article.id}'">
> > >>
> > >>
> > >> * Would this be a good option? Is this the denormalization you were
> > >> proposing? *
> > >>
> > >> *If yes, would I then be able to query a specific field of articles or
> > >> other "table" (with the same OR BETTER performances)? If yes, I might
> > >> probably merge all the different indexes together. *
> > >> *I'm currently joining everything in mysql, so duplicating the fields
> in
> > >> the solr (pseudo code):*
> > >> <entity  name="all" query="select * from articles INNER JOIN journal
> on
> > >> [...]">
> > >> *So I have an index for authors query, a general one for articles
> (only
> > >> needed info of other tables) ...*
> > >>
> > >> Thanks in advance for the tips. :)
> > >>
> > >> Kind regards,
> > >> Bastien
> > >>
> > >> On 14/04/2016 16:23, Jack Krupansky wrote:
> > >>
> > >> Solr is a search engine, not a database.
> > >>
> > >> JOINs? Although Solr does have some limited JOIN capabilities, they
> are
> > >> more for special situations, not the front-line go-to technique for
> data
> > >> modeling for search.
> > >>
> > >> Rather, denormalization is the front-line go-to technique for data
> > >> modeling in Solr.
> > >>
> > >> In any case, the first step in data modeling is always to focus on
> your
> > >> queries - what information will be coming into your apps and what
> > >> information will the apps want to access based on those inputs.
> > >>
> > >> But wait... you say you are upgrading, which suggests that you have an
> > >> existing Solr data model, and probably queries as well. So...
> > >>
> > >> 1. Share at least a summary of your existing Solr data model as well
> as
> > >> at least a summary of the kinds of queries you perform today.
> > >> 2. Tell us what exacting is driving your inquiry - are queries too
> slow,
> > >> too cumbersome, not sufficiently powerful, or... what exactly is the
> > >> problem you need to solve.
> > >>
> > >>
> > >> -- Jack Krupansky
> > >>
> > >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> > >> <la...@mdpi.com.invalid> wrote:
> > >>
> > >>> Hi Guys,
> > >>>
> > >>> *I am upgrading from solr 4.2 to 6.0.*
> > >>> *I successfully (after some time) migrated the config files and other
> > >>> parameters...*
> > >>>
> > >>> Now I'm just wondering if my indexes are following the best
> > >>> practices...(and they are probably not :-) )
> > >>>
> > >>> What would be the best if we have this kind of sql data to write in
> > Solr:
> > >>>
> > >>>
> > >>> I have several different services which need (more or less),
> different
> > >>> data based on these JOINs...
> > >>>
> > >>> e.g.:
> > >>> Service A needs lots of data (but bot all),
> > >>> Service B needs a few data (some fields already included in A),
> > >>> Service C needs a bit more data than B(some fields already included
> in
> > >>> A/B)...
> > >>>
> > >>> *1. Would it be better to create one single index?*
> > >>> *-> i.e.: this will duplicate journal info for every single article*
> > >>>
> > >>> *2. Would it be better to create several specific indexes for each
> > >>> similar services?*
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> *-> i.e.: this will use more space on the disks (and there are
> > >>> ~70millions of documents to join) 3. Would it be better to create an
> > index
> > >>> per table and make a join? -> if yes, how?? *
> > >>>
> > >>> Kind regards,
> > >>> Bastien
> > >>>
> > >>>
> > >>
> > >> Kind regards,
> > >> Bastien Latard
> > >> Web engineer
> > >> --
> > >> MDPI AG
> > >> Postfach, CH-4005 Basel, Switzerland
> > >> Office: Klybeckstrasse 64, CH-4057
> > >> Tel. +41 61 683 77 35
> > >> Fax: +41 61 302 89 18
> > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > >>
> > >>
> > >> Kind regards,
> > >> Bastien Latard
> > >> Web engineer
> > >> --
> > >> MDPI AG
> > >> Postfach, CH-4005 Basel, Switzerland
> > >> Office: Klybeckstrasse 64, CH-4057
> > >> Tel. +41 61 683 77 35
> > >> Fax: +41 61 302 89 18
> > >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> > >>
> > >>
> > >
> >
>

Re: Solr best practices for many to many relations...

Posted by Jack Krupansky <ja...@gmail.com>.

It will be interesting to see which use cases work best with the new
streaming JOIN vs. which will remain best with full denormalization, or
whether you simply have to try both and benchmark them.

My impression had been that streaming JOIN would be ideal for bulk
operations rather than traditional-style search queries. Maybe there are
three use cases: bulk read based on broad criteria, top-n relevance search
query, and specific document (or small number of documents) based on
multiple fields.

My suspicion is that doing JOIN on five tables will likely be slower than
accessing a single document of a denormalized table/index.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <jo...@gmail.com> wrote:

> Solr now has full distributed join capabilities as part of the Streaming
> Expression library. Keep in mind that these are distributed joins so they
> shuffle records to worker nodes to perform the joins. These are comparable
> to joins done by SQL over MapReduce systems, but they are very responsive
> and can respond with sub-second response time for fairly large joins in
> parallel mode. But these joins do lend themselves to large distributed
> architectures (lot's of shards an replicas). Target QPS also needs to be
> taken into account and tested in deciding whether these joins will meet the
> specific use case.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com> wrote:
>
> > The Streaming API with Streaming Expressions (or Parallel SQL if you want
> > to use SQL) can give you the functionality you're looking for. See
> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > and
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > SQL queries coming in through the Parallel SQL Interface are translated
> > down into Streaming Expressions - if you need to do something that SQL
> > doesn't yet support you should check out the Streaming Expressions to see
> > if it can support it.
> >
> > With these you could store your data in separate collections (or the same
> > collection with different docType field values) and then during search
> > perform a join (inner, outer, hash) across the collections. You could, if
> > you wanted, even join with data NOT in solr using the jdbc streaming
> > function.
> >
> > - Dennis Gove
> >
> >
> > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > latard@mdpi.com.invalid> wrote:
> >
> >> '*would I then be able to query a specific field of articles or other
> >> "table" (with the same OR BETTER performances)?*'
> >> -> And especially, would I be able to get only 1 article in the
> result...
> >>
> >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> >>
> >> Thanks Jack.
> >>
> >> I know that Solr is a search engine, but this replace a search in my
> >> mysql DB with this model:
> >>
> >>
> >> *My goal is to improve my environment (and my performances at the same
> >> time).*
> >>
> >> *Yes, I have a Solr data model... but atm I created 4 different indexes
> >> for "similar service usage".*
> >> *So atm, for 70 millions of documents, I am duplicating journal data and
> >> publisher data all the time in 1 index (for all articles from the same
> >> journal/pub) in order to be able to retrieve all data in 1 query...*
> >>
> >> *I found yesterday that there is the possibility to create like an array
> >> of <entity> in the data-conf.xml.*
> >> e.g. (pseudo code - incomplete):
> >> <entity  name="solr_publisher" query="select name from publishers">
> >> <entity name="solr_journal" query="select name as j_name from journals
> >> WHERE publisher_id='${solr_publisher.id}'">
> >> <entity name="solr_articles" query="select title, abstract from articles
> >> WHERE journal_id='${solr_journal.id}'">
> >> <entity name="solr_authors" query="select given_name, last_name from
> >> authors WHERE article_id='${solr_article.id}'">
> >>
> >>
> >> * Would this be a good option? Is this the denormalization you were
> >> proposing? *
> >>
> >> *If yes, would I then be able to query a specific field of articles or
> >> other "table" (with the same OR BETTER performances)? If yes, I might
> >> probably merge all the different indexes together. *
> >> *I'm currently joining everything in mysql, so duplicating the fields in
> >> the solr (pseudo code):*
> >> <entity  name="all" query="select * from articles INNER JOIN journal on
> >> [...]">
> >> *So I have an index for authors query, a general one for articles (only
> >> needed info of other tables) ...*
> >>
> >> Thanks in advance for the tips. :)
> >>
> >> Kind regards,
> >> Bastien
> >>
> >> On 14/04/2016 16:23, Jack Krupansky wrote:
> >>
> >> Solr is a search engine, not a database.
> >>
> >> JOINs? Although Solr does have some limited JOIN capabilities, they are
> >> more for special situations, not the front-line go-to technique for data
> >> modeling for search.
> >>
> >> Rather, denormalization is the front-line go-to technique for data
> >> modeling in Solr.
> >>
> >> In any case, the first step in data modeling is always to focus on your
> >> queries - what information will be coming into your apps and what
> >> information will the apps want to access based on those inputs.
> >>
> >> But wait... you say you are upgrading, which suggests that you have an
> >> existing Solr data model, and probably queries as well. So...
> >>
> >> 1. Share at least a summary of your existing Solr data model as well as
> >> at least a summary of the kinds of queries you perform today.
> >> 2. Tell us what exacting is driving your inquiry - are queries too slow,
> >> too cumbersome, not sufficiently powerful, or... what exactly is the
> >> problem you need to solve.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> >> <la...@mdpi.com.invalid> wrote:
> >>
> >>> Hi Guys,
> >>>
> >>> *I am upgrading from solr 4.2 to 6.0.*
> >>> *I successfully (after some time) migrated the config files and other
> >>> parameters...*
> >>>
> >>> Now I'm just wondering if my indexes are following the best
> >>> practices...(and they are probably not :-) )
> >>>
> >>> What would be the best if we have this kind of sql data to write in
> Solr:
> >>>
> >>>
> >>> I have several different services which need (more or less), different
> >>> data based on these JOINs...
> >>>
> >>> e.g.:
> >>> Service A needs lots of data (but bot all),
> >>> Service B needs a few data (some fields already included in A),
> >>> Service C needs a bit more data than B(some fields already included in
> >>> A/B)...
> >>>
> >>> *1. Would it be better to create one single index?*
> >>> *-> i.e.: this will duplicate journal info for every single article*
> >>>
> >>> *2. Would it be better to create several specific indexes for each
> >>> similar services?*
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *-> i.e.: this will use more space on the disks (and there are
> >>> ~70millions of documents to join) 3. Would it be better to create an
> index
> >>> per table and make a join? -> if yes, how?? *
> >>>
> >>> Kind regards,
> >>> Bastien
> >>>
> >>>
> >>
> >> Kind regards,
> >> Bastien Latard
> >> Web engineer
> >> --
> >> MDPI AG
> >> Postfach, CH-4005 Basel, Switzerland
> >> Office: Klybeckstrasse 64, CH-4057
> >> Tel. +41 61 683 77 35
> >> Fax: +41 61 302 89 18
> >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> >>
> >>
> >> Kind regards,
> >> Bastien Latard
> >> Web engineer
> >> --
> >> MDPI AG
> >> Postfach, CH-4005 Basel, Switzerland
> >> Office: Klybeckstrasse 64, CH-4057
> >> Tel. +41 61 683 77 35
> >> Fax: +41 61 302 89 18
> >> E-mail: latard@mdpi.comhttp://www.mdpi.com/
> >>
> >>
> >
>

Re: Solr best practices for many to many relations...

Posted by Joel Bernstein <jo...@gmail.com>.

Solr now has full distributed join capabilities as part of the Streaming
Expression library. Keep in mind that these are distributed joins so they
shuffle records to worker nodes to perform the joins. These are comparable
to joins done by SQL over MapReduce systems, but they are very responsive
and can respond with sub-second response time for fairly large joins in
parallel mode. But these joins do lend themselves to large distributed
architectures (lot's of shards an replicas). Target QPS also needs to be
taken into account and tested in deciding whether these joins will meet the
specific use case.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dp...@gmail.com> wrote:

> The Streaming API with Streaming Expressions (or Parallel SQL if you want
> to use SQL) can give you the functionality you're looking for. See
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> and
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> SQL queries coming in through the Parallel SQL Interface are translated
> down into Streaming Expressions - if you need to do something that SQL
> doesn't yet support you should check out the Streaming Expressions to see
> if it can support it.
>
> With these you could store your data in separate collections (or the same
> collection with different docType field values) and then during search
> perform a join (inner, outer, hash) across the collections. You could, if
> you wanted, even join with data NOT in solr using the jdbc streaming
> function.
>
> - Dennis Gove
>
>
> On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> latard@mdpi.com.invalid> wrote:
>
>> '*would I then be able to query a specific field of articles or other
>> "table" (with the same OR BETTER performances)?*'
>> -> And especially, would I be able to get only 1 article in the result...
>>
>> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>>
>> Thanks Jack.
>>
>> I know that Solr is a search engine, but this replace a search in my
>> mysql DB with this model:
>>
>>
>> *My goal is to improve my environment (and my performances at the same
>> time).*
>>
>> *Yes, I have a Solr data model... but atm I created 4 different indexes
>> for "similar service usage".*
>> *So atm, for 70 millions of documents, I am duplicating journal data and
>> publisher data all the time in 1 index (for all articles from the same
>> journal/pub) in order to be able to retrieve all data in 1 query...*
>>
>> *I found yesterday that there is the possibility to create like an array
>> of <entity> in the data-conf.xml.*
>> e.g. (pseudo code - incomplete):
>> <entity  name="solr_publisher" query="select name from publishers">
>> <entity name="solr_journal" query="select name as j_name from journals
>> WHERE publisher_id='${solr_publisher.id}'">
>> <entity name="solr_articles" query="select title, abstract from articles
>> WHERE journal_id='${solr_journal.id}'">
>> <entity name="solr_authors" query="select given_name, last_name from
>> authors WHERE article_id='${solr_article.id}'">
>>
>>
>> * Would this be a good option? Is this the denormalization you were
>> proposing? *
>>
>> *If yes, would I then be able to query a specific field of articles or
>> other "table" (with the same OR BETTER performances)? If yes, I might
>> probably merge all the different indexes together. *
>> *I'm currently joining everything in mysql, so duplicating the fields in
>> the solr (pseudo code):*
>> <entity  name="all" query="select * from articles INNER JOIN journal on
>> [...]">
>> *So I have an index for authors query, a general one for articles (only
>> needed info of other tables) ...*
>>
>> Thanks in advance for the tips. :)
>>
>> Kind regards,
>> Bastien
>>
>> On 14/04/2016 16:23, Jack Krupansky wrote:
>>
>> Solr is a search engine, not a database.
>>
>> JOINs? Although Solr does have some limited JOIN capabilities, they are
>> more for special situations, not the front-line go-to technique for data
>> modeling for search.
>>
>> Rather, denormalization is the front-line go-to technique for data
>> modeling in Solr.
>>
>> In any case, the first step in data modeling is always to focus on your
>> queries - what information will be coming into your apps and what
>> information will the apps want to access based on those inputs.
>>
>> But wait... you say you are upgrading, which suggests that you have an
>> existing Solr data model, and probably queries as well. So...
>>
>> 1. Share at least a summary of your existing Solr data model as well as
>> at least a summary of the kinds of queries you perform today.
>> 2. Tell us what exacting is driving your inquiry - are queries too slow,
>> too cumbersome, not sufficiently powerful, or... what exactly is the
>> problem you need to solve.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
>> <la...@mdpi.com.invalid> wrote:
>>
>>> Hi Guys,
>>>
>>> *I am upgrading from solr 4.2 to 6.0.*
>>> *I successfully (after some time) migrated the config files and other
>>> parameters...*
>>>
>>> Now I'm just wondering if my indexes are following the best
>>> practices...(and they are probably not :-) )
>>>
>>> What would be the best if we have this kind of sql data to write in Solr:
>>>
>>>
>>> I have several different services which need (more or less), different
>>> data based on these JOINs...
>>>
>>> e.g.:
>>> Service A needs lots of data (but bot all),
>>> Service B needs a few data (some fields already included in A),
>>> Service C needs a bit more data than B(some fields already included in
>>> A/B)...
>>>
>>> *1. Would it be better to create one single index?*
>>> *-> i.e.: this will duplicate journal info for every single article*
>>>
>>> *2. Would it be better to create several specific indexes for each
>>> similar services?*
>>>
>>>
>>>
>>>
>>>
>>> *-> i.e.: this will use more space on the disks (and there are
>>> ~70millions of documents to join) 3. Would it be better to create an index
>>> per table and make a join? -> if yes, how?? *
>>>
>>> Kind regards,
>>> Bastien
>>>
>>>
>>
>> Kind regards,
>> Bastien Latard
>> Web engineer
>> --
>> MDPI AG
>> Postfach, CH-4005 Basel, Switzerland
>> Office: Klybeckstrasse 64, CH-4057
>> Tel. +41 61 683 77 35
>> Fax: +41 61 302 89 18
>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>
>>
>> Kind regards,
>> Bastien Latard
>> Web engineer
>> --
>> MDPI AG
>> Postfach, CH-4005 Basel, Switzerland
>> Office: Klybeckstrasse 64, CH-4057
>> Tel. +41 61 683 77 35
>> Fax: +41 61 302 89 18
>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>
>>
>

Re: Solr best practices for many to many relations...

Posted by Dennis Gove <dp...@gmail.com>.

The Streaming API with Streaming Expressions (or Parallel SQL if you want
to use SQL) can give you the functionality you're looking for. See
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions and
https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
SQL queries coming in through the Parallel SQL Interface are translated
down into Streaming Expressions - if you need to do something that SQL
doesn't yet support you should check out the Streaming Expressions to see
if it can support it.

With these you could store your data in separate collections (or the same
collection with different docType field values) and then during search
perform a join (inner, outer, hash) across the collections. You could, if
you wanted, even join with data NOT in solr using the jdbc streaming
function.

- Dennis Gove


On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
latard@mdpi.com.invalid> wrote:

> '*would I then be able to query a specific field of articles or other
> "table" (with the same OR BETTER performances)?*'
> -> And especially, would I be able to get only 1 article in the result...
>
> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>
> Thanks Jack.
>
> I know that Solr is a search engine, but this replace a search in my mysql
> DB with this model:
>
>
> *My goal is to improve my environment (and my performances at the same
> time).*
>
> *Yes, I have a Solr data model... but atm I created 4 different indexes
> for "similar service usage".*
> *So atm, for 70 millions of documents, I am duplicating journal data and
> publisher data all the time in 1 index (for all articles from the same
> journal/pub) in order to be able to retrieve all data in 1 query...*
>
> *I found yesterday that there is the possibility to create like an array
> of <entity> in the data-conf.xml.*
> e.g. (pseudo code - incomplete):
> <entity  name="solr_publisher" query="select name from publishers">
> <entity name="solr_journal" query="select name as j_name from journals
> WHERE publisher_id='${solr_publisher.id}'">
> <entity name="solr_articles" query="select title, abstract from articles
> WHERE journal_id='${solr_journal.id}'">
> <entity name="solr_authors" query="select given_name, last_name from
> authors WHERE article_id='${solr_article.id}'">
>
>
> * Would this be a good option? Is this the denormalization you were
> proposing? *
>
> *If yes, would I then be able to query a specific field of articles or
> other "table" (with the same OR BETTER performances)? If yes, I might
> probably merge all the different indexes together. *
> *I'm currently joining everything in mysql, so duplicating the fields in
> the solr (pseudo code):*
> <entity  name="all" query="select * from articles INNER JOIN journal on
> [...]">
> *So I have an index for authors query, a general one for articles (only
> needed info of other tables) ...*
>
> Thanks in advance for the tips. :)
>
> Kind regards,
> Bastien
>
> On 14/04/2016 16:23, Jack Krupansky wrote:
>
> Solr is a search engine, not a database.
>
> JOINs? Although Solr does have some limited JOIN capabilities, they are
> more for special situations, not the front-line go-to technique for data
> modeling for search.
>
> Rather, denormalization is the front-line go-to technique for data
> modeling in Solr.
>
> In any case, the first step in data modeling is always to focus on your
> queries - what information will be coming into your apps and what
> information will the apps want to access based on those inputs.
>
> But wait... you say you are upgrading, which suggests that you have an
> existing Solr data model, and probably queries as well. So...
>
> 1. Share at least a summary of your existing Solr data model as well as at
> least a summary of the kinds of queries you perform today.
> 2. Tell us what exacting is driving your inquiry - are queries too slow,
> too cumbersome, not sufficiently powerful, or... what exactly is the
> problem you need to solve.
>
>
> -- Jack Krupansky
>
> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
> <la...@mdpi.com.invalid> wrote:
>
>> Hi Guys,
>>
>> *I am upgrading from solr 4.2 to 6.0.*
>> *I successfully (after some time) migrated the config files and other
>> parameters...*
>>
>> Now I'm just wondering if my indexes are following the best
>> practices...(and they are probably not :-) )
>>
>> What would be the best if we have this kind of sql data to write in Solr:
>>
>>
>> I have several different services which need (more or less), different
>> data based on these JOINs...
>>
>> e.g.:
>> Service A needs lots of data (but bot all),
>> Service B needs a few data (some fields already included in A),
>> Service C needs a bit more data than B(some fields already included in
>> A/B)...
>>
>> *1. Would it be better to create one single index?*
>> *-> i.e.: this will duplicate journal info for every single article*
>>
>> *2. Would it be better to create several specific indexes for each
>> similar services?*
>>
>>
>>
>>
>>
>> *-> i.e.: this will use more space on the disks (and there are
>> ~70millions of documents to join) 3. Would it be better to create an index
>> per table and make a join? -> if yes, how?? *
>>
>> Kind regards,
>> Bastien
>>
>>
>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>
>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>
>

Re: Solr best practices for many to many relations...

Posted by Bastien Latard - MDPI AG <la...@mdpi.com.INVALID>.

'/would I then be able to query a specific field of articles or other 
"table" (with the same OR BETTER performances)?/'
-> And especially, would I be able to get only 1 article in the result...

On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> Thanks Jack.
>
> I know that Solr is a search engine, but this replace a search in my 
> mysql DB with this model:
>
>
> *My goal is to improve my environment (and my performances at the same 
> time).*
> /
> //Yes, I have a Solr data model... but atm I created 4 different 
> indexes for "similar service usage".//
> //So atm, for 70 millions of documents, I am duplicating journal data 
> and publisher data all the time in 1 index (for all articles from the 
> same journal/pub) in order to be able to retrieve all data in 1 query.../
>
> *I found yesterday that there is the possibility to create like an 
> array of <entity> in the data-conf.xml.*
> e.g. (pseudo code - incomplete):
> <entity  name="solr_publisher" query="select name from publishers">
> <entity name="solr_journal" query="select name as j_name from journals 
> WHERE publisher_id='${solr_publisher.id}'">
> <entity name="solr_articles" query="select title, abstract from 
> articles WHERE journal_id='${solr_journal.id}'">
> <entity name="solr_authors" query="select given_name, last_name from 
> authors WHERE article_id='${solr_article.id}'">
> *
> Would this be a good option? Is this the denormalization you were 
> proposing?
> */If yes, would I then be able to query a specific field of articles 
> or other "table" (with the same OR BETTER performances)?
> If yes, I might probably merge all the different indexes together.
> /*
> */I'm currently joining everything in mysql, so duplicating the fields 
> in the solr (pseudo code):/
> <entity  name="all" query="select * from articles INNER JOIN journal 
> on [...]">*
> */So I have an index for authors query, a general one for articles 
> (only needed info of other tables) .../*
>
> *Thanks in advance for the tips. :)
> *
> *Kind regards,
> Bastien*
> *
>
> On 14/04/2016 16:23, Jack Krupansky wrote:
>> Solr is a search engine, not a database.
>>
>> JOINs? Although Solr does have some limited JOIN capabilities, they 
>> are more for special situations, not the front-line go-to technique 
>> for data modeling for search.
>>
>> Rather, denormalization is the front-line go-to technique for data 
>> modeling in Solr.
>>
>> In any case, the first step in data modeling is always to focus on 
>> your queries - what information will be coming into your apps and 
>> what information will the apps want to access based on those inputs.
>>
>> But wait... you say you are upgrading, which suggests that you have 
>> an existing Solr data model, and probably queries as well. So...
>>
>> 1. Share at least a summary of your existing Solr data model as well 
>> as at least a summary of the kinds of queries you perform today.
>> 2. Tell us what exacting is driving your inquiry - are queries too 
>> slow, too cumbersome, not sufficiently powerful, or... what exactly 
>> is the problem you need to solve.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG 
>> <la...@mdpi.com.invalid> wrote:
>>
>>     Hi Guys,
>>
>>     /I am upgrading from solr 4.2 to 6.0.//
>>     //I successfully (after some time) migrated the config files and
>>     other parameters.../
>>
>>     Now I'm just wondering if my indexes are following the best
>>     practices...(and they are probably not :-) )
>>
>>     What would be the best if we have this kind of sql data to write
>>     in Solr:
>>
>>
>>     I have several different services which need (more or less),
>>     different data based on these JOINs...
>>
>>     e.g.:
>>     Service A needs lots of data (but bot all),
>>     Service B needs a few data (some fields already included in A),
>>     Service C needs a bit more data than B(some fields already
>>     included in A/B)...
>>
>>     *1. Would it be better to create one single index?**
>>     **-> i.e.: this will duplicate journal info for every single
>>     article**
>>     **
>>     **2. Would it be better to create several specific indexes for
>>     each similar services?**
>>     **-> i.e.: this will use more space on the disks (and there are
>>     ~70millions of documents to join)
>>
>>     3. Would it be better to create an index per table and make a join?
>>     -> if yes, how??
>>
>>     *
>>
>>     Kind regards,
>>     Bastien
>>
>>
>
> Kind regards,
> Bastien Latard
> Web engineer
> -- 
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail:
> latard@mdpi.com
> http://www.mdpi.com/

Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
latard@mdpi.com
http://www.mdpi.com/

Re: Solr best practices for many to many relations...

Posted by Bastien Latard - MDPI AG <la...@mdpi.com.INVALID>.

Thanks Jack.

I know that Solr is a search engine, but this replace a search in my 
mysql DB with this model:

*My goal is to improve my environment (and my performances at the same 
time).*
/
//Yes, I have a Solr data model... but atm I created 4 different indexes 
for "similar service usage".//
//So atm, for 70 millions of documents, I am duplicating journal data 
and publisher data all the time in 1 index (for all articles from the 
same journal/pub) in order to be able to retrieve all data in 1 query.../

*I found yesterday that there is the possibility to create like an array 
of <entity> in the data-conf.xml.*
e.g. (pseudo code - incomplete):
<entity  name="solr_publisher" query="select name from publishers">
<entity name="solr_journal" query="select name as j_name from journals 
WHERE publisher_id='${solr_publisher.id}'">
<entity name="solr_articles" query="select title, abstract from articles 
WHERE journal_id='${solr_journal.id}'">
<entity name="solr_authors" query="select given_name, last_name from 
authors WHERE article_id='${solr_article.id}'">
*
Would this be a good option? Is this the denormalization you were proposing?
*/If yes, would I then be able to query a specific field of articles or 
other "table" (with the same OR BETTER performances)?
If yes, I might probably merge all the different indexes together.
/*
*/I'm currently joining everything in mysql, so duplicating the fields 
in the solr (pseudo code):/
<entity  name="all" query="select * from articles INNER JOIN journal on 
[...]">*
*/So I have an index for authors query, a general one for articles (only 
needed info of other tables) .../*

*Thanks in advance for the tips. :)
*
*Kind regards,
Bastien*
*

On 14/04/2016 16:23, Jack Krupansky wrote:
> Solr is a search engine, not a database.
>
> JOINs? Although Solr does have some limited JOIN capabilities, they 
> are more for special situations, not the front-line go-to technique 
> for data modeling for search.
>
> Rather, denormalization is the front-line go-to technique for data 
> modeling in Solr.
>
> In any case, the first step in data modeling is always to focus on 
> your queries - what information will be coming into your apps and what 
> information will the apps want to access based on those inputs.
>
> But wait... you say you are upgrading, which suggests that you have an 
> existing Solr data model, and probably queries as well. So...
>
> 1. Share at least a summary of your existing Solr data model as well 
> as at least a summary of the kinds of queries you perform today.
> 2. Tell us what exacting is driving your inquiry - are queries too 
> slow, too cumbersome, not sufficiently powerful, or... what exactly is 
> the problem you need to solve.
>
>
> -- Jack Krupansky
>
> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG 
> <latard@mdpi.com.invalid <ma...@mdpi.com.invalid>> wrote:
>
>     Hi Guys,
>
>     /I am upgrading from solr 4.2 to 6.0.//
>     //I successfully (after some time) migrated the config files and
>     other parameters.../
>
>     Now I'm just wondering if my indexes are following the best
>     practices...(and they are probably not :-) )
>
>     What would be the best if we have this kind of sql data to write
>     in Solr:
>
>
>     I have several different services which need (more or less),
>     different data based on these JOINs...
>
>     e.g.:
>     Service A needs lots of data (but bot all),
>     Service B needs a few data (some fields already included in A),
>     Service C needs a bit more data than B(some fields already
>     included in A/B)...
>
>     *1. Would it be better to create one single index?**
>     **-> i.e.: this will duplicate journal info for every single article**
>     **
>     **2. Would it be better to create several specific indexes for
>     each similar services?**
>     **-> i.e.: this will use more space on the disks (and there are
>     ~70millions of documents to join)
>
>     3. Would it be better to create an index per table and make a join?
>     -> if yes, how??
>
>     *
>
>     Kind regards,
>     Bastien
>
>

Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
latard@mdpi.com
http://www.mdpi.com/

Re: Solr best practices for many to many relations...

Posted by Jack Krupansky <ja...@gmail.com>.

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities, they are
more for special situations, not the front-line go-to technique for data
modeling for search.

Rather, denormalization is the front-line go-to technique for data modeling
in Solr.

In any case, the first step in data modeling is always to focus on your
queries - what information will be coming into your apps and what
information will the apps want to access based on those inputs.

But wait... you say you are upgrading, which suggests that you have an
existing Solr data model, and probably queries as well. So...

1. Share at least a summary of your existing Solr data model as well as at
least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries too slow,
too cumbersome, not sufficiently powerful, or... what exactly is the
problem you need to solve.

-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
latard@mdpi.com.invalid> wrote:

> Hi Guys,
>
> *I am upgrading from solr 4.2 to 6.0.*
> *I successfully (after some time) migrated the config files and other
> parameters...*
>
> Now I'm just wondering if my indexes are following the best
> practices...(and they are probably not :-) )
>
> What would be the best if we have this kind of sql data to write in Solr:
>
>
> I have several different services which need (more or less), different
> data based on these JOINs...
>
> e.g.:
> Service A needs lots of data (but bot all),
> Service B needs a few data (some fields already included in A),
> Service C needs a bit more data than B(some fields already included in
> A/B)...
>
> *1. Would it be better to create one single index?*
> *-> i.e.: this will duplicate journal info for every single article*
>
> *2. Would it be better to create several specific indexes for each similar
> services?*
>
>
>
>
>
> *-> i.e.: this will use more space on the disks (and there are ~70millions
> of documents to join) 3. Would it be better to create an index per table
> and make a join? -> if yes, how?? *
>
> Kind regards,
> Bastien
>
>