You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Troy Edwards <te...@gmail.com> on 2015/12/24 04:43:28 UTC

Solr 6 - Relational Index querying

In Solr 5.1.0 we had to flatten out two collections into one

Item - about 1.5 million items with primary key - ItemId (this mainly
contains item description)

FacilityItem - about 10,000 facilities - primary key - FacilityItemId
(pricing information for each facility) - ItemId points to Item

We are currently using this index for only about 200 facilities. We are
using edismax parser to query and boost results

I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we can use
two collections so that it will be helpful in doing updates.

But so far I have not seen something that will exactly fit what we need.

Any thoughts/suggestions on what documentation to read or any samples on
how to approach what we are trying to achieve?

Thanks

Re: Solr 6 - Relational Index querying

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Dec 28, 2015 at 9:11 AM, Joel Bernstein <jo...@gmail.com> wrote:
> In
> order to join result sets you would typically need to be working with the
> entire result sets from both sides of the join, which may be too slow
> without the /export handler.

We now have https://issues.apache.org/jira/browse/SOLR-8220
"Read field from docValues for non stored fields"
And there is an issue open for how to optimize for cases when fields
have both docValues and are stored.

Then the other missing optimization is to use a more efficient sort
(and possibly defer sorting until streaming) when dealing with an
entire result set.

-Yonik

Re: Solr 6 - Relational Index querying

Posted by Joel Bernstein <jo...@gmail.com>.
Yes that would work. Each search(...) has it's own specific params and can
point to any handler that conforms to the output format in the /select
handler.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Dec 28, 2015 at 11:12 AM, Dennis Gove <dp...@gmail.com> wrote:

> Correct me if I'm wrong but I believe one can use the /export and /select
> handlers interchangeably within a single streaming expression. This could
> allow you to use the /select handler in the search(...) clause where a
> score is necessary and the /export handler in the search(...) clauses where
> it is not. Assuming the query in the clause with the score is limiting the
> resultset to a reasonable size this might be able to get you around the
> performance problems in using the /select handler in potentially other
> large streams which we are joining with.
>
> On Mon, Dec 28, 2015 at 9:11 AM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > I'll add one important caveat:
> >
> > At this time the /export handler does not support returning scores. In
> > order to join result sets you would typically need to be working with the
> > entire result sets from both sides of the join, which may be too slow
> > without the /export handler. But if you're working with smaller result
> sets
> > it will be possible to use the default /select handler which will return
> > scores.
> >
> > Adding scores to the /export handler does need to get on the roadmap. The
> > initial release of the Streaming API was really designed for OLAP type
> > queries which typically don't involve scoring.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Dec 28, 2015 at 8:49 AM, Dennis Gove <dp...@gmail.com> wrote:
> >
> > > There have been a lot of new features added to the Streaming API and
> the
> > > documentation hasn't kept pace, but it is something I'd like to have
> > filled
> > > in by the release of Solr 6.
> > >
> > > With the Streaming API you can take two (or more) totally disconnected
> > > collections and get a result set with documents from one, both, or all
> of
> > > them. To be clear, when I say they can be totally disconnected I mean
> > > exactly that - the collections do not need to share any infrastructure
> or
> > > even know about each other in anyway. They can exist across any number
> of
> > > data centers, use completely different Zookeeper clusters, etc... No
> > shared
> > > infrastructure is necessary. Updates/Inserts/Deletes to one of the
> > > collections has zero impact on the other collections.
> > >
> > > In your example, with Items and FacilityItems, I'd most likely
> construct
> > a
> > > join like this (note, I'm using Streaming Expresssions but the same
> would
> > > be possible in SQL).
> > >
> > > innerJoin(
> > >   search(items, fl="itemId,itemDescription", q="*:*", sort="itemId
> asc"),
> > >   search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> > > sort="itemId asc"),
> > >   on="itemId"
> > > )
> > >
> > > This will return documents with the fields itemId, itemDescription,
> > > facilityName, and cost. Because it's an innerJoin only documents with
> > parts
> > > found in both collections will be returned but if you want you can do a
> > > leftOuterJoin as well to get items which may not have facilityItems
> > > documents.
> > >
> > > Regarding the use of boosting - I'll assume that's because you're
> > returning
> > > results in score order. I can't remember the syntax to use in the
> > > search(...) clause to tell it to search by score but for the sake of
> > > discussion let's assume that sort="score desc" would do that (ie,
> highest
> > > score first). This poses a problem on the innerJoin because as it is a
> > > merge based join it does expect the two incoming streams to be sorted
> by
> > > the same fields but with a score sort that isn't possible. However, we
> > can
> > > instead use a hash based join to get around this limitation.
> > >
> > > hashJoin(
> > >   search(items, fl="itemId,itemDescription", q="itemDescription:bear",
> > > sort="score desc"),
> > >   hashed = search(facilityItems, fl="itemId,facilityName,cost",
> q="*:*",
> > > sort="itemId asc"),
> > >   on="itemId"
> > > )
> > >
> > > Note that in this I've changed the first search clause by adding a q
> > clause
> > > to find all where the description includes "bear" and to sort by the
> > score.
> > > I've also marked the second search clause as the on that should be
> > hashed.
> > > The stream that is marked to be hashed will be read in full and all
> > > documents stored in memory - for this reason you'll almost always want
> to
> > > hash the one with the fewest documents in it but do be aware that the
> > order
> > > of the results will depend on the order of the non-hashed stream. For
> > this
> > > reason I've hashed the one whose order we don't necessarily care about
> > and
> > > am preserving the ordering by score.
> > >
> > > This will return the exact same documents but the order will now be by
> > the
> > > score of the match found in the search over the items collections.
> > >
> > > - Dennis
> > >
> > > On Wed, Dec 23, 2015 at 10:43 PM, Troy Edwards <
> tedwards415107@gmail.com
> > >
> > > wrote:
> > >
> > > > In Solr 5.1.0 we had to flatten out two collections into one
> > > >
> > > > Item - about 1.5 million items with primary key - ItemId (this mainly
> > > > contains item description)
> > > >
> > > > FacilityItem - about 10,000 facilities - primary key - FacilityItemId
> > > > (pricing information for each facility) - ItemId points to Item
> > > >
> > > > We are currently using this index for only about 200 facilities. We
> are
> > > > using edismax parser to query and boost results
> > > >
> > > > I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we
> can
> > > use
> > > > two collections so that it will be helpful in doing updates.
> > > >
> > > > But so far I have not seen something that will exactly fit what we
> > need.
> > > >
> > > > Any thoughts/suggestions on what documentation to read or any samples
> > on
> > > > how to approach what we are trying to achieve?
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Re: Solr 6 - Relational Index querying

Posted by Dennis Gove <dp...@gmail.com>.
Correct me if I'm wrong but I believe one can use the /export and /select
handlers interchangeably within a single streaming expression. This could
allow you to use the /select handler in the search(...) clause where a
score is necessary and the /export handler in the search(...) clauses where
it is not. Assuming the query in the clause with the score is limiting the
resultset to a reasonable size this might be able to get you around the
performance problems in using the /select handler in potentially other
large streams which we are joining with.

On Mon, Dec 28, 2015 at 9:11 AM, Joel Bernstein <jo...@gmail.com> wrote:

> I'll add one important caveat:
>
> At this time the /export handler does not support returning scores. In
> order to join result sets you would typically need to be working with the
> entire result sets from both sides of the join, which may be too slow
> without the /export handler. But if you're working with smaller result sets
> it will be possible to use the default /select handler which will return
> scores.
>
> Adding scores to the /export handler does need to get on the roadmap. The
> initial release of the Streaming API was really designed for OLAP type
> queries which typically don't involve scoring.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Dec 28, 2015 at 8:49 AM, Dennis Gove <dp...@gmail.com> wrote:
>
> > There have been a lot of new features added to the Streaming API and the
> > documentation hasn't kept pace, but it is something I'd like to have
> filled
> > in by the release of Solr 6.
> >
> > With the Streaming API you can take two (or more) totally disconnected
> > collections and get a result set with documents from one, both, or all of
> > them. To be clear, when I say they can be totally disconnected I mean
> > exactly that - the collections do not need to share any infrastructure or
> > even know about each other in anyway. They can exist across any number of
> > data centers, use completely different Zookeeper clusters, etc... No
> shared
> > infrastructure is necessary. Updates/Inserts/Deletes to one of the
> > collections has zero impact on the other collections.
> >
> > In your example, with Items and FacilityItems, I'd most likely construct
> a
> > join like this (note, I'm using Streaming Expresssions but the same would
> > be possible in SQL).
> >
> > innerJoin(
> >   search(items, fl="itemId,itemDescription", q="*:*", sort="itemId asc"),
> >   search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> > sort="itemId asc"),
> >   on="itemId"
> > )
> >
> > This will return documents with the fields itemId, itemDescription,
> > facilityName, and cost. Because it's an innerJoin only documents with
> parts
> > found in both collections will be returned but if you want you can do a
> > leftOuterJoin as well to get items which may not have facilityItems
> > documents.
> >
> > Regarding the use of boosting - I'll assume that's because you're
> returning
> > results in score order. I can't remember the syntax to use in the
> > search(...) clause to tell it to search by score but for the sake of
> > discussion let's assume that sort="score desc" would do that (ie, highest
> > score first). This poses a problem on the innerJoin because as it is a
> > merge based join it does expect the two incoming streams to be sorted by
> > the same fields but with a score sort that isn't possible. However, we
> can
> > instead use a hash based join to get around this limitation.
> >
> > hashJoin(
> >   search(items, fl="itemId,itemDescription", q="itemDescription:bear",
> > sort="score desc"),
> >   hashed = search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> > sort="itemId asc"),
> >   on="itemId"
> > )
> >
> > Note that in this I've changed the first search clause by adding a q
> clause
> > to find all where the description includes "bear" and to sort by the
> score.
> > I've also marked the second search clause as the on that should be
> hashed.
> > The stream that is marked to be hashed will be read in full and all
> > documents stored in memory - for this reason you'll almost always want to
> > hash the one with the fewest documents in it but do be aware that the
> order
> > of the results will depend on the order of the non-hashed stream. For
> this
> > reason I've hashed the one whose order we don't necessarily care about
> and
> > am preserving the ordering by score.
> >
> > This will return the exact same documents but the order will now be by
> the
> > score of the match found in the search over the items collections.
> >
> > - Dennis
> >
> > On Wed, Dec 23, 2015 at 10:43 PM, Troy Edwards <tedwards415107@gmail.com
> >
> > wrote:
> >
> > > In Solr 5.1.0 we had to flatten out two collections into one
> > >
> > > Item - about 1.5 million items with primary key - ItemId (this mainly
> > > contains item description)
> > >
> > > FacilityItem - about 10,000 facilities - primary key - FacilityItemId
> > > (pricing information for each facility) - ItemId points to Item
> > >
> > > We are currently using this index for only about 200 facilities. We are
> > > using edismax parser to query and boost results
> > >
> > > I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we can
> > use
> > > two collections so that it will be helpful in doing updates.
> > >
> > > But so far I have not seen something that will exactly fit what we
> need.
> > >
> > > Any thoughts/suggestions on what documentation to read or any samples
> on
> > > how to approach what we are trying to achieve?
> > >
> > > Thanks
> > >
> >
>

Re: Solr 6 - Relational Index querying

Posted by Joel Bernstein <jo...@gmail.com>.
I'll add one important caveat:

At this time the /export handler does not support returning scores. In
order to join result sets you would typically need to be working with the
entire result sets from both sides of the join, which may be too slow
without the /export handler. But if you're working with smaller result sets
it will be possible to use the default /select handler which will return
scores.

Adding scores to the /export handler does need to get on the roadmap. The
initial release of the Streaming API was really designed for OLAP type
queries which typically don't involve scoring.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Dec 28, 2015 at 8:49 AM, Dennis Gove <dp...@gmail.com> wrote:

> There have been a lot of new features added to the Streaming API and the
> documentation hasn't kept pace, but it is something I'd like to have filled
> in by the release of Solr 6.
>
> With the Streaming API you can take two (or more) totally disconnected
> collections and get a result set with documents from one, both, or all of
> them. To be clear, when I say they can be totally disconnected I mean
> exactly that - the collections do not need to share any infrastructure or
> even know about each other in anyway. They can exist across any number of
> data centers, use completely different Zookeeper clusters, etc... No shared
> infrastructure is necessary. Updates/Inserts/Deletes to one of the
> collections has zero impact on the other collections.
>
> In your example, with Items and FacilityItems, I'd most likely construct a
> join like this (note, I'm using Streaming Expresssions but the same would
> be possible in SQL).
>
> innerJoin(
>   search(items, fl="itemId,itemDescription", q="*:*", sort="itemId asc"),
>   search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> sort="itemId asc"),
>   on="itemId"
> )
>
> This will return documents with the fields itemId, itemDescription,
> facilityName, and cost. Because it's an innerJoin only documents with parts
> found in both collections will be returned but if you want you can do a
> leftOuterJoin as well to get items which may not have facilityItems
> documents.
>
> Regarding the use of boosting - I'll assume that's because you're returning
> results in score order. I can't remember the syntax to use in the
> search(...) clause to tell it to search by score but for the sake of
> discussion let's assume that sort="score desc" would do that (ie, highest
> score first). This poses a problem on the innerJoin because as it is a
> merge based join it does expect the two incoming streams to be sorted by
> the same fields but with a score sort that isn't possible. However, we can
> instead use a hash based join to get around this limitation.
>
> hashJoin(
>   search(items, fl="itemId,itemDescription", q="itemDescription:bear",
> sort="score desc"),
>   hashed = search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> sort="itemId asc"),
>   on="itemId"
> )
>
> Note that in this I've changed the first search clause by adding a q clause
> to find all where the description includes "bear" and to sort by the score.
> I've also marked the second search clause as the on that should be hashed.
> The stream that is marked to be hashed will be read in full and all
> documents stored in memory - for this reason you'll almost always want to
> hash the one with the fewest documents in it but do be aware that the order
> of the results will depend on the order of the non-hashed stream. For this
> reason I've hashed the one whose order we don't necessarily care about and
> am preserving the ordering by score.
>
> This will return the exact same documents but the order will now be by the
> score of the match found in the search over the items collections.
>
> - Dennis
>
> On Wed, Dec 23, 2015 at 10:43 PM, Troy Edwards <te...@gmail.com>
> wrote:
>
> > In Solr 5.1.0 we had to flatten out two collections into one
> >
> > Item - about 1.5 million items with primary key - ItemId (this mainly
> > contains item description)
> >
> > FacilityItem - about 10,000 facilities - primary key - FacilityItemId
> > (pricing information for each facility) - ItemId points to Item
> >
> > We are currently using this index for only about 200 facilities. We are
> > using edismax parser to query and boost results
> >
> > I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we can
> use
> > two collections so that it will be helpful in doing updates.
> >
> > But so far I have not seen something that will exactly fit what we need.
> >
> > Any thoughts/suggestions on what documentation to read or any samples on
> > how to approach what we are trying to achieve?
> >
> > Thanks
> >
>

Re: Solr 6 - Relational Index querying

Posted by Dennis Gove <dp...@gmail.com>.
There have been a lot of new features added to the Streaming API and the
documentation hasn't kept pace, but it is something I'd like to have filled
in by the release of Solr 6.

With the Streaming API you can take two (or more) totally disconnected
collections and get a result set with documents from one, both, or all of
them. To be clear, when I say they can be totally disconnected I mean
exactly that - the collections do not need to share any infrastructure or
even know about each other in anyway. They can exist across any number of
data centers, use completely different Zookeeper clusters, etc... No shared
infrastructure is necessary. Updates/Inserts/Deletes to one of the
collections has zero impact on the other collections.

In your example, with Items and FacilityItems, I'd most likely construct a
join like this (note, I'm using Streaming Expresssions but the same would
be possible in SQL).

innerJoin(
  search(items, fl="itemId,itemDescription", q="*:*", sort="itemId asc"),
  search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
sort="itemId asc"),
  on="itemId"
)

This will return documents with the fields itemId, itemDescription,
facilityName, and cost. Because it's an innerJoin only documents with parts
found in both collections will be returned but if you want you can do a
leftOuterJoin as well to get items which may not have facilityItems
documents.

Regarding the use of boosting - I'll assume that's because you're returning
results in score order. I can't remember the syntax to use in the
search(...) clause to tell it to search by score but for the sake of
discussion let's assume that sort="score desc" would do that (ie, highest
score first). This poses a problem on the innerJoin because as it is a
merge based join it does expect the two incoming streams to be sorted by
the same fields but with a score sort that isn't possible. However, we can
instead use a hash based join to get around this limitation.

hashJoin(
  search(items, fl="itemId,itemDescription", q="itemDescription:bear",
sort="score desc"),
  hashed = search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
sort="itemId asc"),
  on="itemId"
)

Note that in this I've changed the first search clause by adding a q clause
to find all where the description includes "bear" and to sort by the score.
I've also marked the second search clause as the on that should be hashed.
The stream that is marked to be hashed will be read in full and all
documents stored in memory - for this reason you'll almost always want to
hash the one with the fewest documents in it but do be aware that the order
of the results will depend on the order of the non-hashed stream. For this
reason I've hashed the one whose order we don't necessarily care about and
am preserving the ordering by score.

This will return the exact same documents but the order will now be by the
score of the match found in the search over the items collections.

- Dennis

On Wed, Dec 23, 2015 at 10:43 PM, Troy Edwards <te...@gmail.com>
wrote:

> In Solr 5.1.0 we had to flatten out two collections into one
>
> Item - about 1.5 million items with primary key - ItemId (this mainly
> contains item description)
>
> FacilityItem - about 10,000 facilities - primary key - FacilityItemId
> (pricing information for each facility) - ItemId points to Item
>
> We are currently using this index for only about 200 facilities. We are
> using edismax parser to query and boost results
>
> I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we can use
> two collections so that it will be helpful in doing updates.
>
> But so far I have not seen something that will exactly fit what we need.
>
> Any thoughts/suggestions on what documentation to read or any samples on
> how to approach what we are trying to achieve?
>
> Thanks
>