You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2016/11/17 23:40:51 UTC

wildcards for /export

I looked through the JIRAs and didn't see anything relevant, but
before raising a JIRA I thought I'd see if there was interest.

In the case where there are dynamic fields, one may not know what all
the fields that are DV fields. I think there's a use case for /export
being able to accept wildcards, i.e. fl=* or fl=*_s. The idea would be
that the handler would get the field list and pull in all the fields
defined with docValues=true.

I suppose that this would be most useful if I got busy and actually
worked on SOLR-3191:"field exclusion from fl".

It's not clear to me though how expensive the introspection would be though.

Thoughts?

Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Thu, Nov 17, 2016 at 8:12 PM, Erick Erickson <er...@gmail.com> wrote:
> Yonik:
>
> Hmmm, we may be closer to that than it might appear. I happened to
> need to do some verification yesterday to determine whether I could
> limit the number of rows returned with TupleStream variants. /export
> of course doesn't do that, the close on a TupleStream waits until the
> entire stream is exhausted and throws the bits on the floor.
>
> Anyway, I was playing around with returning 10M rows with the /query
> and /export handlers and found out that I could indeed use /query and
> limit the rows. Fine so far.
>
> Then just for yucks I decided to try to use the /query handler with
> rows=100M and... the total processing time was virtually identical to
> /export. These weren't very sophisticated tests mind you; they did
> lend evidence that your idea is probably the way to go though.

When I did some ad-hoc tests a long time ago, /select was inexplicably
much slower (even when retrieving all docvalues and discounting
sorting time).
Some of the issue was probably a bug was fixed recently in
SolrIndexSearcher.decorateSomethingOrOther that was creating a
top-level DV view.

Some other changes off the top of my head:
- if the number of docs being retrieved is very large (or all via
rows=-1), and if no other components (like highlighting) need the
top-N docs (needDocList), then defer sorting of the matches until
later.
- keep track of the DocSet on the ResponseBuilder (this is already
done when we facet via needDocSet?)
- if sorting was deferred, then sort in the most efficient way we know
how (i.e. don't always use a priority queue), or we can just do it
like export writer currently does.
- Invert the logic that writes DV fields so that we figure out the
fields once, look up the docvalues once, and then efficiently write
them out per document (Noble's addition of PushWriter is the right
direction here).

In the long run, this should be simpler to deal with from both a "fl"
point of view, as well as augmenters, pseudo-fields, and security.

But again, feel free to add whatever to /export in the meantime... I'm
just laying out a bigger picture in case anyone also wants to work
toward that as well.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Erick Erickson <er...@gmail.com>.

Yonik:

Hmmm, we may be closer to that than it might appear. I happened to
need to do some verification yesterday to determine whether I could
limit the number of rows returned with TupleStream variants. /export
of course doesn't do that, the close on a TupleStream waits until the
entire stream is exhausted and throws the bits on the floor.

Anyway, I was playing around with returning 10M rows with the /query
and /export handlers and found out that I could indeed use /query and
limit the rows. Fine so far.

Then just for yucks I decided to try to use the /query handler with
rows=100M and... the total processing time was virtually identical to
/export. These weren't very sophisticated tests mind you; they did
lend evidence that your idea is probably the way to go though.

I guess we'd need some kind of param to only return DV fields when
wildcard fields were specified, but that's an implementation detail.
We could even re-use the /export handler definition in solrconfig.xml
and have it define another invariant property indicating this for
back-compat and then remove some code.

FWIW, using the /query handler _also_ exhausts the result set when you
call close(), it's just a smaller set ;).

I guess no JIRA for this in particular then, I'll let somebody _else_
raise the "remove ExportWriter JIRA"..... I will also point out that
it was a milestone for me when I could get as excited about _removing_
code as writing it....

Erick

On Thu, Nov 17, 2016 at 4:40 PM, Yonik Seeley <ys...@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 6:40 PM, Erick Erickson <er...@gmail.com> wrote:
>> I looked through the JIRAs and didn't see anything relevant, but
>> before raising a JIRA I thought I'd see if there was interest.
>>
>> In the case where there are dynamic fields, one may not know what all
>> the fields that are DV fields. I think there's a use case for /export
>> being able to accept wildcards, i.e. fl=* or fl=*_s. The idea would be
>> that the handler would get the field list and pull in all the fields
>> defined with docValues=true.
>>
>> I suppose that this would be most useful if I got busy and actually
>> worked on SOLR-3191:"field exclusion from fl".
>>
>> It's not clear to me though how expensive the introspection would be though.
>>
>> Thoughts?
>
> At the risk of sounding redundant, I'll point out again that if we
> simply fixed document retrieval to be faster for the normal
> QueryComponent, then this would be moot (and everyone would get faster
> doc retrieval).
>
> I'm sympathetic to the fact that it is more work of course.  Hopefully
> no one disagrees that "/select" document retrieval should be as fast
> as "/export" document retrieval though?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Thu, Nov 17, 2016 at 6:40 PM, Erick Erickson <er...@gmail.com> wrote:
> I looked through the JIRAs and didn't see anything relevant, but
> before raising a JIRA I thought I'd see if there was interest.
>
> In the case where there are dynamic fields, one may not know what all
> the fields that are DV fields. I think there's a use case for /export
> being able to accept wildcards, i.e. fl=* or fl=*_s. The idea would be
> that the handler would get the field list and pull in all the fields
> defined with docValues=true.
>
> I suppose that this would be most useful if I got busy and actually
> worked on SOLR-3191:"field exclusion from fl".
>
> It's not clear to me though how expensive the introspection would be though.
>
> Thoughts?

At the risk of sounding redundant, I'll point out again that if we
simply fixed document retrieval to be faster for the normal
QueryComponent, then this would be moot (and everyone would get faster
doc retrieval).

I'm sympathetic to the fact that it is more work of course.  Hopefully
no one disagrees that "/select" document retrieval should be as fast
as "/export" document retrieval though?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Thu, Nov 17, 2016 at 9:51 PM, Joel Bernstein <jo...@gmail.com> wrote:
> It's possible that we could find a design where /select could behave like
> /export. I think Noble's design of treating a Stream as an iterator is
> promising.

> We could change all document result sets to iterators and hide
> the implementation of how the docs are materialized.

This is already done to a certain degree.  Documents (the stored
fields) have always been streamed (materialized one-by-one), starting
from a sorted int[].  This would just be a further optional step in
that direction starting with an unsorted set instead (in addition to
other needed changes to make things more efficient).

> This would also impact
> how output from other search components would be handled. Since result sets
> aren't limited to top N, all summarized data, such as facets would need to
> come before the documents.

Faceting is currently pretty isolated... just starting with the base
DocSet.  Seems like it could come before or after streaming documents?

> Then Solrj would need to be able to read the
> summarized data into memory, and stream the documents. It's a nice design,
> but quite a bit of work.

Oh, I see where you're going with that... yeah, that's only if you
want to utilize facets + streaming documents in the same request (and
use the facet info to process the docs somehow).
Even w/o that, there are benefits to /select being more performant for
large sets.

-Yonik


> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 17, 2016 at 9:26 PM, Yonik Seeley <ys...@gmail.com> wrote:
>>
>> On Thu, Nov 17, 2016 at 9:16 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>> > There were two issues that make the regular /select hander problematic
>> > for
>> > large result sets:
>> >
>> > 1) Use of stored fields, which require lots of disk access. I believe
>> > this
>> > has been resolved now that the field list can be pulled from the
>> > docValues.
>> >
>> > 2) The /select handler sorts by loading the top N docs into a priority
>> > queue.
>>
>> That feels like it could be optional though.  PQ makes sense for small
>> top-N that will go in the cache, but makes less sense when you want
>> all documents back.
>>
>> Look at it from the other perspective: if one is retrieving all
>> documents that match a query (and lets assume that the number of
>> matches is large), is /export ever less efficient in that case?  If
>> /export is always better in that scenario, that sounds like an
>> optimization, not a tradeoff or different design goal, and /select
>> should always be using the superior algorithm/mechanism for that case.
>>
>> -Yonik
>>
>>
>> > This approach becomes untenable at a certain point. The export
>> > handler, iterates over a bitset of collected docs in multiple passes.
>> > This
>> > keeps constant performance as the result set grows. This is harder to
>> > make
>> > work without avoiding the current select logic.
>> >
>> > I'm not in full agreement that /select and /export need to come
>> > together.
>> > They really do have different design goals. /select tries to be very
>> > efficient and fast to support high QPS. /export tries to maintain
>> > constant
>> > memory use and performance as the result set size increases. Trying to
>> > find
>> > a way to accomplish both may just end up comprising the design so it
>> > doesn't
>> > either use case.
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Thu, Nov 17, 2016 at 9:05 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> >>
>> >> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden
>> >> <co...@gmail.com>
>> >> wrote:
>> >> > For reference, the SQL/JDBC piece needed ability to specify wildcard
>> >> > and
>> >> > figure out the "schema" of the collection including defined dynamic
>> >> > fields.
>> >>
>> >> Out of curiosity, how is this used (and in what contexts)?
>> >> I'm wondering the implications of new fields appearing when new
>> >> documents are added.  Will this mess up the JDBC driver?
>> >>
>> >> > When testing lately with supporting "select *" type semantics, it
>> >> > would
>> >> > be
>> >> > nice to be able to limit to only DocValues fields.
>> >>
>> >> I'm not sure we should be segregating stored fields this way (by
>> >> whether they are column/docValues or not).
>> >> By default, all of our non-text fields already have docvalues enabled.
>> >> If someone wants to retrieve or operate on row-stored text fields, it
>> >> seems like they should be able to do so via the streaming API (or
>> >> SQL).
>> >>
>> >> I guess we could also go the other direction and *only* support
>> >> docValues (i.e. scrap row-stored fields).  But that seems a little
>> >> more extreme, and I'm also not sure if binary docValues would work as
>> >> well or could hold text fields of the same size as row-stored fields
>> >> can.
>> >>
>> >> -Yonik
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Joel Bernstein <jo...@gmail.com>.

One way to adapt Solrj would be to keep it's current memory structure for
facets etc.. and then have it return a TupleStream for documents.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 17, 2016 at 9:51 PM, Joel Bernstein <jo...@gmail.com> wrote:

> It's possible that we could find a design where /select could behave like
> /export. I think Noble's design of treating a Stream as an iterator is
> promising. We could change all document result sets to iterators and hide
> the implementation of how the docs are materialized. This would also impact
> how output from other search components would be handled. Since result sets
> aren't limited to top N, all summarized data, such as facets would need to
> come before the documents. Then Solrj would need to be able to read the
> summarized data into memory, and stream the documents. It's a nice design,
> but quite a bit of work.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 17, 2016 at 9:26 PM, Yonik Seeley <ys...@gmail.com> wrote:
>
>> On Thu, Nov 17, 2016 at 9:16 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>> > There were two issues that make the regular /select hander problematic
>> for
>> > large result sets:
>> >
>> > 1) Use of stored fields, which require lots of disk access. I believe
>> this
>> > has been resolved now that the field list can be pulled from the
>> docValues.
>> >
>> > 2) The /select handler sorts by loading the top N docs into a priority
>> > queue.
>>
>> That feels like it could be optional though.  PQ makes sense for small
>> top-N that will go in the cache, but makes less sense when you want
>> all documents back.
>>
>> Look at it from the other perspective: if one is retrieving all
>> documents that match a query (and lets assume that the number of
>> matches is large), is /export ever less efficient in that case?  If
>> /export is always better in that scenario, that sounds like an
>> optimization, not a tradeoff or different design goal, and /select
>> should always be using the superior algorithm/mechanism for that case.
>>
>> -Yonik
>>
>>
>> > This approach becomes untenable at a certain point. The export
>> > handler, iterates over a bitset of collected docs in multiple passes.
>> This
>> > keeps constant performance as the result set grows. This is harder to
>> make
>> > work without avoiding the current select logic.
>> >
>> > I'm not in full agreement that /select and /export need to come
>> together.
>> > They really do have different design goals. /select tries to be very
>> > efficient and fast to support high QPS. /export tries to maintain
>> constant
>> > memory use and performance as the result set size increases. Trying to
>> find
>> > a way to accomplish both may just end up comprising the design so it
>> doesn't
>> > either use case.
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Thu, Nov 17, 2016 at 9:05 PM, Yonik Seeley <ys...@gmail.com>
>> wrote:
>> >>
>> >> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <
>> compuwizard123@gmail.com>
>> >> wrote:
>> >> > For reference, the SQL/JDBC piece needed ability to specify wildcard
>> and
>> >> > figure out the "schema" of the collection including defined dynamic
>> >> > fields.
>> >>
>> >> Out of curiosity, how is this used (and in what contexts)?
>> >> I'm wondering the implications of new fields appearing when new
>> >> documents are added.  Will this mess up the JDBC driver?
>> >>
>> >> > When testing lately with supporting "select *" type semantics, it
>> would
>> >> > be
>> >> > nice to be able to limit to only DocValues fields.
>> >>
>> >> I'm not sure we should be segregating stored fields this way (by
>> >> whether they are column/docValues or not).
>> >> By default, all of our non-text fields already have docvalues enabled.
>> >> If someone wants to retrieve or operate on row-stored text fields, it
>> >> seems like they should be able to do so via the streaming API (or
>> >> SQL).
>> >>
>> >> I guess we could also go the other direction and *only* support
>> >> docValues (i.e. scrap row-stored fields).  But that seems a little
>> >> more extreme, and I'm also not sure if binary docValues would work as
>> >> well or could hold text fields of the same size as row-stored fields
>> >> can.
>> >>
>> >> -Yonik
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: wildcards for /export

Posted by Joel Bernstein <jo...@gmail.com>.

It's possible that we could find a design where /select could behave like
/export. I think Noble's design of treating a Stream as an iterator is
promising. We could change all document result sets to iterators and hide
the implementation of how the docs are materialized. This would also impact
how output from other search components would be handled. Since result sets
aren't limited to top N, all summarized data, such as facets would need to
come before the documents. Then Solrj would need to be able to read the
summarized data into memory, and stream the documents. It's a nice design,
but quite a bit of work.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 17, 2016 at 9:26 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Thu, Nov 17, 2016 at 9:16 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
> > There were two issues that make the regular /select hander problematic
> for
> > large result sets:
> >
> > 1) Use of stored fields, which require lots of disk access. I believe
> this
> > has been resolved now that the field list can be pulled from the
> docValues.
> >
> > 2) The /select handler sorts by loading the top N docs into a priority
> > queue.
>
> That feels like it could be optional though.  PQ makes sense for small
> top-N that will go in the cache, but makes less sense when you want
> all documents back.
>
> Look at it from the other perspective: if one is retrieving all
> documents that match a query (and lets assume that the number of
> matches is large), is /export ever less efficient in that case?  If
> /export is always better in that scenario, that sounds like an
> optimization, not a tradeoff or different design goal, and /select
> should always be using the superior algorithm/mechanism for that case.
>
> -Yonik
>
>
> > This approach becomes untenable at a certain point. The export
> > handler, iterates over a bitset of collected docs in multiple passes.
> This
> > keeps constant performance as the result set grows. This is harder to
> make
> > work without avoiding the current select logic.
> >
> > I'm not in full agreement that /select and /export need to come together.
> > They really do have different design goals. /select tries to be very
> > efficient and fast to support high QPS. /export tries to maintain
> constant
> > memory use and performance as the result set size increases. Trying to
> find
> > a way to accomplish both may just end up comprising the design so it
> doesn't
> > either use case.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Nov 17, 2016 at 9:05 PM, Yonik Seeley <ys...@gmail.com> wrote:
> >>
> >> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <compuwizard123@gmail.com
> >
> >> wrote:
> >> > For reference, the SQL/JDBC piece needed ability to specify wildcard
> and
> >> > figure out the "schema" of the collection including defined dynamic
> >> > fields.
> >>
> >> Out of curiosity, how is this used (and in what contexts)?
> >> I'm wondering the implications of new fields appearing when new
> >> documents are added.  Will this mess up the JDBC driver?
> >>
> >> > When testing lately with supporting "select *" type semantics, it
> would
> >> > be
> >> > nice to be able to limit to only DocValues fields.
> >>
> >> I'm not sure we should be segregating stored fields this way (by
> >> whether they are column/docValues or not).
> >> By default, all of our non-text fields already have docvalues enabled.
> >> If someone wants to retrieve or operate on row-stored text fields, it
> >> seems like they should be able to do so via the streaming API (or
> >> SQL).
> >>
> >> I guess we could also go the other direction and *only* support
> >> docValues (i.e. scrap row-stored fields).  But that seems a little
> >> more extreme, and I'm also not sure if binary docValues would work as
> >> well or could hold text fields of the same size as row-stored fields
> >> can.
> >>
> >> -Yonik
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Thu, Nov 17, 2016 at 9:16 PM, Joel Bernstein <jo...@gmail.com> wrote:
> There were two issues that make the regular /select hander problematic for
> large result sets:
>
> 1) Use of stored fields, which require lots of disk access. I believe this
> has been resolved now that the field list can be pulled from the docValues.
>
> 2) The /select handler sorts by loading the top N docs into a priority
> queue.

That feels like it could be optional though.  PQ makes sense for small
top-N that will go in the cache, but makes less sense when you want
all documents back.

Look at it from the other perspective: if one is retrieving all
documents that match a query (and lets assume that the number of
matches is large), is /export ever less efficient in that case?  If
/export is always better in that scenario, that sounds like an
optimization, not a tradeoff or different design goal, and /select
should always be using the superior algorithm/mechanism for that case.

-Yonik


> This approach becomes untenable at a certain point. The export
> handler, iterates over a bitset of collected docs in multiple passes. This
> keeps constant performance as the result set grows. This is harder to make
> work without avoiding the current select logic.
>
> I'm not in full agreement that /select and /export need to come together.
> They really do have different design goals. /select tries to be very
> efficient and fast to support high QPS. /export tries to maintain constant
> memory use and performance as the result set size increases. Trying to find
> a way to accomplish both may just end up comprising the design so it doesn't
> either use case.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 17, 2016 at 9:05 PM, Yonik Seeley <ys...@gmail.com> wrote:
>>
>> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <co...@gmail.com>
>> wrote:
>> > For reference, the SQL/JDBC piece needed ability to specify wildcard and
>> > figure out the "schema" of the collection including defined dynamic
>> > fields.
>>
>> Out of curiosity, how is this used (and in what contexts)?
>> I'm wondering the implications of new fields appearing when new
>> documents are added.  Will this mess up the JDBC driver?
>>
>> > When testing lately with supporting "select *" type semantics, it would
>> > be
>> > nice to be able to limit to only DocValues fields.
>>
>> I'm not sure we should be segregating stored fields this way (by
>> whether they are column/docValues or not).
>> By default, all of our non-text fields already have docvalues enabled.
>> If someone wants to retrieve or operate on row-stored text fields, it
>> seems like they should be able to do so via the streaming API (or
>> SQL).
>>
>> I guess we could also go the other direction and *only* support
>> docValues (i.e. scrap row-stored fields).  But that seems a little
>> more extreme, and I'm also not sure if binary docValues would work as
>> well or could hold text fields of the same size as row-stored fields
>> can.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Joel Bernstein <jo...@gmail.com>.

There were two issues that make the regular /select hander problematic for
large result sets:

1) Use of stored fields, which require lots of disk access. I believe this
has been resolved now that the field list can be pulled from the docValues.

2) The /select handler sorts by loading the top N docs into a priority
queue. This approach becomes untenable at a certain point. The export
handler, iterates over a bitset of collected docs in multiple passes. This
keeps constant performance as the result set grows. This is harder to make
work without avoiding the current select logic.

I'm not in full agreement that /select and /export need to come together.
They really do have different design goals. /select tries to be very
efficient and fast to support high QPS. /export tries to maintain constant
memory use and performance as the result set size increases. Trying to find
a way to accomplish both may just end up comprising the design so it
doesn't either use case.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 17, 2016 at 9:05 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <co...@gmail.com>
> wrote:
> > For reference, the SQL/JDBC piece needed ability to specify wildcard and
> > figure out the "schema" of the collection including defined dynamic
> fields.
>
> Out of curiosity, how is this used (and in what contexts)?
> I'm wondering the implications of new fields appearing when new
> documents are added.  Will this mess up the JDBC driver?
>
> > When testing lately with supporting "select *" type semantics, it would
> be
> > nice to be able to limit to only DocValues fields.
>
> I'm not sure we should be segregating stored fields this way (by
> whether they are column/docValues or not).
> By default, all of our non-text fields already have docvalues enabled.
> If someone wants to retrieve or operate on row-stored text fields, it
> seems like they should be able to do so via the streaming API (or
> SQL).
>
> I guess we could also go the other direction and *only* support
> docValues (i.e. scrap row-stored fields).  But that seems a little
> more extreme, and I'm also not sure if binary docValues would work as
> well or could hold text fields of the same size as row-stored fields
> can.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: wildcards for /export

Posted by Kevin Risden <co...@gmail.com>.

Right now we basically do what you said for the JDBC driver to return the
column headers. Add a special metadata tuple that has the fields info.

For Calcite (and more SQL support), we need to know information about the
fields ahead of time. Off the top of my head it is used for:

   - Knowing which fields are available
   - Which field data types are available
   - Generate a plan on how to evaluate
      - ie: do things need to be casted

Right now we aren't doing any of that, so we are lazily determining fields.
The mismatch is between SQL being strongly typed with a schema and Solr not
requiring that.

This is only an issue right now for dynamic fields. For fields that are
defined in the schema, a single call to the schema API makes it easy to
determine fields that are there. Dynamic fields could end up being anything
which makes it slightly harder.

Kevin Risden

On Fri, Nov 18, 2016 at 11:06 AM, Yonik Seeley <ys...@gmail.com> wrote:

> On Fri, Nov 18, 2016 at 11:46 AM, Kevin Risden <co...@gmail.com>
> wrote:
> >> Out of curiosity, how is this used (and in what contexts)?
> >> I'm wondering the implications of new fields appearing when new
> >> documents are added.  Will this mess up the JDBC driver?
> >
> >
> > Currently its not being used. Select * isn't supported yet.
>
> Does "select *" need to know ahead of time what fields there will be?
> Although retrieving the set of stored fields (or the set of used
> fields) should be easy to implement, it doesn't seem like the most
> efficient way to implement things (at least at the streaming API
> level..)
>
> If we need to know the set of fields that match a wildcard before
> streaming all of the documents, we could easily add that info
> on-demand at the head (sort of like column headers).
>
> So currently we have:
>
> {
>   "response": {
>     "docs": [...]
>    }
> }
>
> And we could insert "fields" inside the response object:
>
> {
>   "response": {
>     "fields":["field1","field2","field3","field4"]
>     "docs": [...]
>    }
> }
>
> Or alternately outside of the "response" object:
>
> {
>     "fields":["field1","field2","field3","field4"],
>    "response": {
>     "docs": [...]
>    }
> }
>
> Or is there some reason we need the full set of fields before we even
> send the request?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Fri, Nov 18, 2016 at 11:46 AM, Kevin Risden <co...@gmail.com> wrote:
>> Out of curiosity, how is this used (and in what contexts)?
>> I'm wondering the implications of new fields appearing when new
>> documents are added.  Will this mess up the JDBC driver?
>
>
> Currently its not being used. Select * isn't supported yet.

Does "select *" need to know ahead of time what fields there will be?
Although retrieving the set of stored fields (or the set of used
fields) should be easy to implement, it doesn't seem like the most
efficient way to implement things (at least at the streaming API
level..)

If we need to know the set of fields that match a wildcard before
streaming all of the documents, we could easily add that info
on-demand at the head (sort of like column headers).

So currently we have:

{
  "response": {
    "docs": [...]
   }
}

And we could insert "fields" inside the response object:

{
  "response": {
    "fields":["field1","field2","field3","field4"]
    "docs": [...]
   }
}

Or alternately outside of the "response" object:

{
    "fields":["field1","field2","field3","field4"],
   "response": {
    "docs": [...]
   }
}

Or is there some reason we need the full set of fields before we even
send the request?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Kevin Risden <co...@gmail.com>.

>
> Out of curiosity, how is this used (and in what contexts)?
> I'm wondering the implications of new fields appearing when new
> documents are added.  Will this mess up the JDBC driver?


Currently its not being used. Select * isn't supported yet. The JDBC driver
will just push down to the SQL handler. It will be in the SQL handler that
we need to worry about it. Maybe on a commit or something can refresh the
field list. Haven't gotten that far yet.

Kevin Risden

On Thu, Nov 17, 2016 at 8:05 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <co...@gmail.com>
> wrote:
> > For reference, the SQL/JDBC piece needed ability to specify wildcard and
> > figure out the "schema" of the collection including defined dynamic
> fields.
>
> Out of curiosity, how is this used (and in what contexts)?
> I'm wondering the implications of new fields appearing when new
> documents are added.  Will this mess up the JDBC driver?
>
> > When testing lately with supporting "select *" type semantics, it would
> be
> > nice to be able to limit to only DocValues fields.
>
> I'm not sure we should be segregating stored fields this way (by
> whether they are column/docValues or not).
> By default, all of our non-text fields already have docvalues enabled.
> If someone wants to retrieve or operate on row-stored text fields, it
> seems like they should be able to do so via the streaming API (or
> SQL).
>
> I guess we could also go the other direction and *only* support
> docValues (i.e. scrap row-stored fields).  But that seems a little
> more extreme, and I'm also not sure if binary docValues would work as
> well or could hold text fields of the same size as row-stored fields
> can.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: wildcards for /export

Posted by Yonik Seeley <ys...@gmail.com>.

On Thu, Nov 17, 2016 at 6:54 PM, Kevin Risden <co...@gmail.com> wrote:
> For reference, the SQL/JDBC piece needed ability to specify wildcard and
> figure out the "schema" of the collection including defined dynamic fields.

Out of curiosity, how is this used (and in what contexts)?
I'm wondering the implications of new fields appearing when new
documents are added.  Will this mess up the JDBC driver?

> When testing lately with supporting "select *" type semantics, it would be
> nice to be able to limit to only DocValues fields.

I'm not sure we should be segregating stored fields this way (by
whether they are column/docValues or not).
By default, all of our non-text fields already have docvalues enabled.
If someone wants to retrieve or operate on row-stored text fields, it
seems like they should be able to do so via the streaming API (or
SQL).

I guess we could also go the other direction and *only* support
docValues (i.e. scrap row-stored fields).  But that seems a little
more extreme, and I'm also not sure if binary docValues would work as
well or could hold text fields of the same size as row-stored fields
can.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: wildcards for /export

Posted by Kevin Risden <co...@gmail.com>.

For reference, the SQL/JDBC piece needed ability to specify wildcard and
figure out the "schema" of the collection including defined dynamic fields.
When testing lately with supporting "select *" type semantics, it would be
nice to be able to limit to only DocValues fields.

The introspection for determining fields needed to hit all shards w/ the
luke request handler due to some shards maybe having a dynamic field in
them. The Schema API gives back only dynamic schema fields but not actually
which fields are present in the collection. SOLR-8823 has an implementation
that does the introspection if that helps.

Some related JIRAs:

   - implement get columns for JDBC spec - SOLR-8823
   - Calcite - SOLR-8593
      - uses the same logic basically as SOLR-8823

Kevin Risden

On Thu, Nov 17, 2016 at 5:40 PM, Erick Erickson <er...@gmail.com>
wrote:

> I looked through the JIRAs and didn't see anything relevant, but
> before raising a JIRA I thought I'd see if there was interest.
>
> In the case where there are dynamic fields, one may not know what all
> the fields that are DV fields. I think there's a use case for /export
> being able to accept wildcards, i.e. fl=* or fl=*_s. The idea would be
> that the handler would get the field list and pull in all the fields
> defined with docValues=true.
>
> I suppose that this would be most useful if I got busy and actually
> worked on SOLR-3191:"field exclusion from fl".
>
> It's not clear to me though how expensive the introspection would be
> though.
>
> Thoughts?
>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>