You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Smith <lo...@gmail.com> on 2018/02/23 17:17:33 UTC

statistics in hitlist

I'm using solr, and enabling stats as per this page:
https://lucene.apache.org/solr/guide/6_6/the-stats-component.html

I want to get more stat values though. Specifically I'm looking for
r-squared (coefficient of determination). This value is not present in
solr, however some of the pieces used to calculate r^2 are in the stats
element, for example:

<double name="min">0.0</double>
<double name="max">10.0</double>
<long name="count">15</long>
<long name="missing">17</long>
<double name="sum">85.0</double>
<double name="sumOfSquares">603.0</double>
<double name="mean">5.666666666666667</double>
<double name="stddev">2.943920288775949</double>


So I have the sumOfSquares available (SST), and using this calculation, I
can get R^2:

R^2 = 1 - SSE/SST

All I need then is SSE. Is there anyway I can get SSE from those other
stats in solr?

Thanks in advance!

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
With regression you're looking at how the change in one variable effects
the change in another variable. So you need to have values that are
changing. What you described is an average of field X which is not
changing, regressed against the value of X.

I think one approach to this is to regress the moving average of X with the
actual value of X. We can do this with the math library, but before
exploring the code for this spend some
thinking about if that's the problem you're trying to solve. Take a look at
how moving averages work: https://en.wikipedia.org/wiki/Moving_average





Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Mar 16, 2018 at 9:26 AM, John Smith <lo...@gmail.com> wrote:

> Thanks for the link to the documentation, that will probably come in
> useful.
>
> I didn't see a way though, to get my avg function working? So instead of
> doing a linear regression on two fields, X and Y, in a hitlist, we need to
> do a linear regression on field X, and the average value of X. Is that
> possible? To pass in a function to the regress function instead of a field?
>
>
>
>
>
> On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > I've been working on the user guide for the math expressions. Here is the
> > page on regression:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/regression.adoc
> >
> > This page is part of the larger math expression documentation. The TOC is
> > here:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/math-expressions.adoc
> >
> > The docs are still very rough but you can get an idea of the coverage.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > If you want to get everything in query you can do this:
> > >
> > > let(echo="d,e",
> > >      a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> > > *]",
> > > fq="isParent:true", rows="1500000",
> > > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > sort="id
> > > asc"),
> > >      b=col(a, oil_first_90_days_production),
> > >      c=col(a, oil_last_30_days_production),
> > >      d=regress(b, c),
> > >      e=someExpression())
> > >
> > > The echo parameter tells the let expression which variables to output.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > >
> > >> What does the fq clause look like?
> > >>
> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <lo...@gmail.com>
> > >> wrote:
> > >> > Hi Joel, I did some more work on this statistics stuff today. Yes,
> we
> > do
> > >> > have nulls in our data; the document contains many fields, we don't
> > >> always
> > >> > have values for each field, but we can't set the nulls to 0 either
> (or
> > >> any
> > >> > other value, really) as that will mess up other calculations (such
> as
> > >> when
> > >> > calculating average etc); we would normally just ignore fields with
> > null
> > >> > values when calculating stats manually ourselves.
> > >> >
> > >> > Adding a check in the "q" parameter to ensure that the fields used
> in
> > >> the
> > >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> > >> should
> > >> > have caught that myself). But I am unable to use "fq" for these
> > checks,
> > >> > they have to be added to the q instead. Adding fq's doesn't have any
> > >> effect.
> > >> >
> > >> >
> > >> > Anyway, I'm trying to change this up a little. This is what I'm
> > >> currently
> > >> > using (switched from "random" to "search" since I actually need the
> > full
> > >> > hitlist not just a random subset):
> > >> >
> > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> > TO
> > >> *]",
> > >> > fq="isParent:true", rows="1500000",
> > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > >> sort="id
> > >> > asc"),
> > >> >      b=col(a, oil_first_90_days_production),
> > >> >      c=col(a, oil_last_30_days_production),
> > >> >      d=regress(b, c))
> > >> >
> > >> > So I have 2 fields there defined, that works great (in terms of a
> test
> > >> and
> > >> > running the query); but I need to replace the second field,
> > >> > "oil_last_30_days_production" with the avg value in
> > >> > oil_first_90_days_production.
> > >> >
> > >> > I can get the avg with this expression:
> > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_
> > production))
> > >> >
> > >> > But I don't know how to push that avg value into the first streaming
> > >> > expression; guessing I have to set "c=...." but that is where I'm
> > >> getting
> > >> > lost, since avg only returns 1 value and the first parameter, "b",
> > >> returns
> > >> > a list of sorts. Somehow I have to get the avg value stuffed inside
> a
> > >> > "col", where it is the same value for every row in the hitlist...?
> > >> >
> > >> > Thanks for your help!
> > >> >
> > >> >
> > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <joelsolr@gmail.com
> >
> > >> wrote:
> > >> >
> > >> >> I suspect you've got nulls in your data. I just tested with null
> > >> values and
> > >> >> got the same error. For testing purposes try loading the data with
> > >> default
> > >> >> values of zero.
> > >> >>
> > >> >>
> > >> >> Joel Bernstein
> > >> >> http://joelsolr.blogspot.com/
> > >> >>
> > >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <
> joelsolr@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > Let's break the expression down and build it up slowly. Let's
> start
> > >> with:
> > >> >> >
> > >> >> > let(echo="true",
> > >> >> >      a=random(tx_prod_production, q="*:*", fq="isParent:true",
> > >> rows="15",
> > >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> > >> >> >      b=col(a, oil_first_90_days_production))
> > >> >> >
> > >> >> >
> > >> >> > This should return variables a and b. Let's see what the data
> looks
> > >> like.
> > >> >> > I changed the rows from 15 to 15000. If it all looks good we can
> > >> expand
> > >> >> the
> > >> >> > rows and continue adding functions.
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > Joel Bernstein
> > >> >> > http://joelsolr.blogspot.com/
> > >> >> >
> > >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <localdevjs@gmail.com
> >
> > >> wrote:
> > >> >> >
> > >> >> >> Thanks Joel for your help on this.
> > >> >> >>
> > >> >> >> What I've done so far:
> > >> >> >> - unzip downloaded solr-7.2
> > >> >> >> - modify the _default "managed-schema" to add the random field
> > type
> > >> and
> > >> >> >> the dynamic random field
> > >> >> >> - start solr7 using "solr start -c"
> > >> >> >> - indexed my data using pint/pdouble/boolean field types etc
> > >> >> >>
> > >> >> >> I can now run the random function all by itself, it returns
> random
> > >> >> >> results as expected. So far so good!
> > >> >> >>
> > >> >> >> However... now trying to get the regression stuff working:
> > >> >> >>
> > >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> > >> >> >> rows="15000", fl="oil_first_90_days_producti
> > >> >> >> on,oil_last_30_days_production"),
> > >> >> >>     b=col(a, oil_first_90_days_production),
> > >> >> >>     c=col(a, oil_last_30_days_production),
> > >> >> >>     d=regress(b, c))
> > >> >> >>
> > >> >> >> Posted directly into solr admin UI. Run the streaming expression
> > >> and I
> > >> >> >> get this error message:
> > >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) -
> Numeric
> > >> value
> > >> >> >> expected but found type java.lang.String for value
> > >> >> >> oil_first_90_days_production"
> > >> >> >>
> > >> >> >> It thinks my numeric field is defined as a string? But when I
> view
> > >> the
> > >> >> >> schema, those 2 fields are defined as ints:
> > >> >> >>
> > >> >> >>
> > >> >> >> When I run a normal query and choose xml as output format, then
> it
> > >> also
> > >> >> >> puts "int" elements into the hitlist, so the schema appears to
> be
> > >> >> correct
> > >> >> >> it's just when using this regress function that something goes
> > >> wrong and
> > >> >> >> solr thinks the field is string.
> > >> >> >>
> > >> >> >> Any suggestions?
> > >> >> >> Thanks!
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <
> > joelsolr@gmail.com>
> > >> >> >> wrote:
> > >> >> >>
> > >> >> >>> The field type will also need to be in the schema:
> > >> >> >>>
> > >> >> >>>  <!-- The "RandomSortField" is not used to store or search any
> > >> >> >>>
> > >> >> >>>          data.  You can declare fields of this type it in your
> > >> schema
> > >> >> >>>
> > >> >> >>>          to generate pseudo-random orderings of your docs for
> > >> sorting
> > >> >> >>>
> > >> >> >>>          or function purposes.  The ordering is generated based
> > on
> > >> the
> > >> >> >>> field
> > >> >> >>>
> > >> >> >>>          name and the version of the index. As long as the
> index
> > >> >> version
> > >> >> >>>
> > >> >> >>>          remains unchanged, and the same field name is reused,
> > >> >> >>>
> > >> >> >>>          the ordering of the docs will be consistent.
> > >> >> >>>
> > >> >> >>>          If you want different psuedo-random orderings of
> > >> documents,
> > >> >> >>>
> > >> >> >>>          for the same version of the index, use a dynamicField
> > and
> > >> >> >>>
> > >> >> >>>          change the field name in the request.
> > >> >> >>>
> > >> >> >>>      -->
> > >> >> >>>
> > >> >> >>> <fieldType name="random" class="solr.RandomSortField"
> > >> indexed="true" />
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> Joel Bernstein
> > >> >> >>> http://joelsolr.blogspot.com/
> > >> >> >>>
> > >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <
> > joelsolr@gmail.com
> > >> >
> > >> >> >>> wrote:
> > >> >> >>>
> > >> >> >>> > You'll need to have this field in your schema:
> > >> >> >>> >
> > >> >> >>> > <dynamicField name="random_*" type="random" />
> > >> >> >>> >
> > >> >> >>> > I'll check to see if the default schema used with solr start
> -c
> > >> has
> > >> >> >>> this
> > >> >> >>> > field, if not I'll add it. Thanks for pointing this out.
> > >> >> >>> >
> > >> >> >>> > I checked and right now the random expression is only
> accepting
> > >> one
> > >> >> fq,
> > >> >> >>> > but I consider this a bug. It should accept multiple. I'll
> > create
> > >> >> >>> ticket
> > >> >> >>> > for getting this fixed.
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > Joel Bernstein
> > >> >> >>> > http://joelsolr.blogspot.com/
> > >> >> >>> >
> > >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <
> > localdevjs@gmail.com
> > >> >
> > >> >> >>> wrote:
> > >> >> >>> >
> > >> >> >>> >> Joel, thanks for the pointers to the streaming feature. I
> had
> > no
> > >> >> idea
> > >> >> >>> solr
> > >> >> >>> >> had that (and also just discovered the very intersting sql
> > >> feature!
> > >> >> I
> > >> >> >>> will
> > >> >> >>> >> be sure to investigate that in more detail in the future).
> > >> >> >>> >>
> > >> >> >>> >> However I'm having some trouble getting basic streaming
> > >> functions
> > >> >> >>> working.
> > >> >> >>> >> I've already figured out that I had to move to "solr cloud"
> > >> instead
> > >> >> of
> > >> >> >>> >> "solr standalone" because I was getting errors about "cannot
> > >> find zk
> > >> >> >>> >> instance" or whatever which went away when using "solr start
> > -c"
> > >> >> >>> instead.
> > >> >> >>> >>
> > >> >> >>> >> But now I'm trying to use the random function since that was
> > >> one of
> > >> >> >>> the
> > >> >> >>> >> functions used in your example.
> > >> >> >>> >>
> > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> > >> >> >>> >>
> > >> >> >>> >> I posted that directly in the "stream" section of the solr
> > >> admin UI.
> > >> >> >>> This
> > >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several
> > >> versions
> > >> >> in
> > >> >> >>> case
> > >> >> >>> >> it was a bug in one)
> > >> >> >>> >>
> > >> >> >>> >> I get back an error message:
> > >> >> >>> >> *sort param could not be parsed as a query, and is not a
> field
> > >> that
> > >> >> >>> exists
> > >> >> >>> >> in the index: random_-255009774*
> > >> >> >>> >>
> > >> >> >>> >> I'm not passing in any sort field anywhere. But the solr
> logs
> > >> show
> > >> >> >>> these
> > >> >> >>> >> three log entries:
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> o.a.s.c.S.Request
> > >> >> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> > >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> > >> >> >>> >> *&sort=random_-255009774+asc*&
> rows=100&wt=javabin&version=2}
> > >> >> >>> status=400
> > >> >> >>> >> QTime=19
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> > >> >> >>> o.a.s.c.s.i.CloudSolrClient
> > >> >> >>> >> Request to collection [tx_header] failed due to (400)
> > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> > >> >> RemoteSolrException:
> > >> >> >>> >> Error
> > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header:
> sort
> > >> param
> > >> >> >>> could
> > >> >> >>> >> not be parsed as a query, and is not a field that exists in
> > the
> > >> >> index:
> > >> >> >>> >> random_-255009774, retry? 0
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> > >> >> >>> o.a.s.c.s.i.s.ExceptionStream
> > >> >> >>> >> java.io.IOException:
> > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> > >> >> RemoteSolrException:
> > >> >> >>> >> Error
> > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header:
> sort
> > >> param
> > >> >> >>> could
> > >> >> >>> >> not be parsed as a query, and is not a field that exists in
> > the
> > >> >> index:
> > >> >> >>> >> random_-255009774
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> So basically it looks like solr is injecting the
> > "sort=random_"
> > >> >> stuff
> > >> >> >>> into
> > >> >> >>> >> my query and of course that is failing on the search since
> > that
> > >> >> >>> >> field/column doesn't exist in my schema. Everytime I run the
> > >> random
> > >> >> >>> >> function, I get a slightly different field name that it
> > >> injects, but
> > >> >> >>> they
> > >> >> >>> >> all start with "random_" etc.
> > >> >> >>> >>
> > >> >> >>> >> I have tried adding my own sort field instead, hoping solr
> > >> wouldn't
> > >> >> >>> inject
> > >> >> >>> >> one for me, but it still injected a random sort fieldname:
> > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> > >> >> >>> sort="countyname
> > >> >> >>> >> asc")
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> Assuming I can fix that whole problem, my second question
> is:
> > >> can I
> > >> >> >>> add
> > >> >> >>> >> multiple "fq=" parameters to the random function? I build a
> > >> pretty
> > >> >> >>> >> complicated query using many fq= fields, and then want to
> run
> > >> some
> > >> >> >>> stats
> > >> >> >>> >> on
> > >> >> >>> >> that hitlist; so somehow I have to pass in the query that
> made
> > >> up
> > >> >> the
> > >> >> >>> >> exact
> > >> >> >>> >> hitlist to these various functions, but when I used multiple
> > >> "fq="
> > >> >> >>> values
> > >> >> >>> >> it only seemed to use the last one I specified and just
> > ignored
> > >> all
> > >> >> >>> the
> > >> >> >>> >> previous fq's?
> > >> >> >>> >>
> > >> >> >>> >> Thanks in advance for any comments/suggestions...!
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
> > >> joelsolr@gmail.com
> > >> >> >
> > >> >> >>> >> wrote:
> > >> >> >>> >>
> > >> >> >>> >> > This is going to be a complex answer because Solr actually
> > >> now has
> > >> >> >>> >> multiple
> > >> >> >>> >> > ways of doing regression analysis as part of the Streaming
> > >> >> >>> Expression
> > >> >> >>> >> > statistical programming library. The basic documentation
> is
> > >> here:
> > >> >> >>> >> >
> > >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-
> > program
> > >> >> >>> ming.html
> > >> >> >>> >> >
> > >> >> >>> >> > Here is a sample expression that performs a simple linear
> > >> >> >>> regression in
> > >> >> >>> >> > Solr 7.2:
> > >> >> >>> >> >
> > >> >> >>> >> > let(a=random(collection1, q="any query", rows="15000",
> > >> fl="fieldA,
> > >> >> >>> >> > fieldB"),
> > >> >> >>> >> >     b=col(a, fieldA),
> > >> >> >>> >> >     c=col(a, fieldB),
> > >> >> >>> >> >     d=regress(b, c))
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> > The expression above takes a random sample of 15000
> results
> > >> from
> > >> >> >>> >> > collection1. The result set will include fieldA and fieldB
> > in
> > >> each
> > >> >> >>> >> record.
> > >> >> >>> >> > The result set is stored in variable "a".
> > >> >> >>> >> >
> > >> >> >>> >> > Then the "col" function creates arrays of numbers from the
> > >> results
> > >> >> >>> >> stored
> > >> >> >>> >> > in variable a. The values in fieldA are stored in the
> > variable
> > >> >> "b".
> > >> >> >>> The
> > >> >> >>> >> > values in fieldB are stored in variable "c".
> > >> >> >>> >> >
> > >> >> >>> >> > Then the regress function performs a simple linear
> > regression
> > >> on
> > >> >> >>> arrays
> > >> >> >>> >> > stored in variables "b" and "c".
> > >> >> >>> >> >
> > >> >> >>> >> > The output of the regress function is a map containing the
> > >> >> >>> regression
> > >> >> >>> >> > result. This result includes RSquared and other attributes
> > of
> > >> the
> > >> >> >>> >> > regression model such as R (correlation), slope, y
> intercept
> > >> >> etc...
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> > Joel Bernstein
> > >> >> >>> >> > http://joelsolr.blogspot.com/
> > >> >> >>> >> >
> > >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
> > >> localdevjs@gmail.com
> > >> >> >
> > >> >> >>> >> wrote:
> > >> >> >>> >> >
> > >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats
> > guy,
> > >> but
> > >> >> >>> the
> > >> >> >>> >> end
> > >> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is
> > >> there no
> > >> >> >>> way of
> > >> >> >>> >> > > obtaining this value, then (short of iterating over all
> > the
> > >> >> >>> results in
> > >> >> >>> >> > the
> > >> >> >>> >> > > hitlist and calculating it myself)?
> > >> >> >>> >> > >
> > >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> > >> >> >>> joelsolr@gmail.com>
> > >> >> >>> >> > > wrote:
> > >> >> >>> >> > >
> > >> >> >>> >> > > > Typically SSE is the sum of the squared errors of the
> > >> >> >>> prediction in
> > >> >> >>> >> a
> > >> >> >>> >> > > > regression analysis. The stats component doesn't
> perform
> > >> >> >>> regression,
> > >> >> >>> >> > > > although it might be a nice feature.
> > >> >> >>> >> > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > Joel Bernstein
> > >> >> >>> >> > > > http://joelsolr.blogspot.com/
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> > >> >> >>> localdevjs@gmail.com>
> > >> >> >>> >> > > wrote:
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > > I'm using solr, and enabling stats as per this page:
> > >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> > >> >> component
> > >> >> >>> .html
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > I want to get more stat values though. Specifically
> > I'm
> > >> >> >>> looking
> > >> >> >>> >> for
> > >> >> >>> >> > > > > r-squared (coefficient of determination). This value
> > is
> > >> not
> > >> >> >>> >> present
> > >> >> >>> >> > in
> > >> >> >>> >> > > > > solr, however some of the pieces used to calculate
> r^2
> > >> are
> > >> >> in
> > >> >> >>> the
> > >> >> >>> >> > stats
> > >> >> >>> >> > > > > element, for example:
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > <double name="min">0.0</double>
> > >> >> >>> >> > > > > <double name="max">10.0</double>
> > >> >> >>> >> > > > > <long name="count">15</long>
> > >> >> >>> >> > > > > <long name="missing">17</long>
> > >> >> >>> >> > > > > <double name="sum">85.0</double>
> > >> >> >>> >> > > > > <double name="sumOfSquares">603.0</double>
> > >> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
> > >> >> >>> >> > > > > <double name="stddev">2.943920288775949</double>
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and
> using
> > >> this
> > >> >> >>> >> > > calculation, I
> > >> >> >>> >> > > > > can get R^2:
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > R^2 = 1 - SSE/SST
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get
> SSE
> > >> from
> > >> >> >>> those
> > >> >> >>> >> > other
> > >> >> >>> >> > > > > stats in solr?
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > Thanks in advance!
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > >
> > >> >> >>> >> >
> > >> >> >>> >>
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >>
> > >>
> > >
> > >
> >
>

Re: statistics in hitlist

Posted by John Smith <lo...@gmail.com>.
Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <jo...@gmail.com> wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >      a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="1500000",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >      b=col(a, oil_first_90_days_production),
> >      c=col(a, oil_last_30_days_production),
> >      d=regress(b, c),
> >      e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <lo...@gmail.com>
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="1500000",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >      b=col(a, oil_first_90_days_production),
> >> >      c=col(a, oil_last_30_days_production),
> >> >      d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=...." but that is where I'm
> >> getting
> >> > lost, since avg only returns 1 value and the first parameter, "b",
> >> returns
> >> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> >> > "col", where it is the same value for every row in the hitlist...?
> >> >
> >> > Thanks for your help!
> >> >
> >> >
> >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >> >
> >> >> I suspect you've got nulls in your data. I just tested with null
> >> values and
> >> >> got the same error. For testing purposes try loading the data with
> >> default
> >> >> values of zero.
> >> >>
> >> >>
> >> >> Joel Bernstein
> >> >> http://joelsolr.blogspot.com/
> >> >>
> >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Let's break the expression down and build it up slowly. Let's start
> >> with:
> >> >> >
> >> >> > let(echo="true",
> >> >> >      a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15",
> >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >> >> >      b=col(a, oil_first_90_days_production))
> >> >> >
> >> >> >
> >> >> > This should return variables a and b. Let's see what the data looks
> >> like.
> >> >> > I changed the rows from 15 to 15000. If it all looks good we can
> >> expand
> >> >> the
> >> >> > rows and continue adding functions.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > Joel Bernstein
> >> >> > http://joelsolr.blogspot.com/
> >> >> >
> >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com>
> >> wrote:
> >> >> >
> >> >> >> Thanks Joel for your help on this.
> >> >> >>
> >> >> >> What I've done so far:
> >> >> >> - unzip downloaded solr-7.2
> >> >> >> - modify the _default "managed-schema" to add the random field
> type
> >> and
> >> >> >> the dynamic random field
> >> >> >> - start solr7 using "solr start -c"
> >> >> >> - indexed my data using pint/pdouble/boolean field types etc
> >> >> >>
> >> >> >> I can now run the random function all by itself, it returns random
> >> >> >> results as expected. So far so good!
> >> >> >>
> >> >> >> However... now trying to get the regression stuff working:
> >> >> >>
> >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> >> >> rows="15000", fl="oil_first_90_days_producti
> >> >> >> on,oil_last_30_days_production"),
> >> >> >>     b=col(a, oil_first_90_days_production),
> >> >> >>     c=col(a, oil_last_30_days_production),
> >> >> >>     d=regress(b, c))
> >> >> >>
> >> >> >> Posted directly into solr admin UI. Run the streaming expression
> >> and I
> >> >> >> get this error message:
> >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric
> >> value
> >> >> >> expected but found type java.lang.String for value
> >> >> >> oil_first_90_days_production"
> >> >> >>
> >> >> >> It thinks my numeric field is defined as a string? But when I view
> >> the
> >> >> >> schema, those 2 fields are defined as ints:
> >> >> >>
> >> >> >>
> >> >> >> When I run a normal query and choose xml as output format, then it
> >> also
> >> >> >> puts "int" elements into the hitlist, so the schema appears to be
> >> >> correct
> >> >> >> it's just when using this regress function that something goes
> >> wrong and
> >> >> >> solr thinks the field is string.
> >> >> >>
> >> >> >> Any suggestions?
> >> >> >> Thanks!
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <
> joelsolr@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> The field type will also need to be in the schema:
> >> >> >>>
> >> >> >>>  <!-- The "RandomSortField" is not used to store or search any
> >> >> >>>
> >> >> >>>          data.  You can declare fields of this type it in your
> >> schema
> >> >> >>>
> >> >> >>>          to generate pseudo-random orderings of your docs for
> >> sorting
> >> >> >>>
> >> >> >>>          or function purposes.  The ordering is generated based
> on
> >> the
> >> >> >>> field
> >> >> >>>
> >> >> >>>          name and the version of the index. As long as the index
> >> >> version
> >> >> >>>
> >> >> >>>          remains unchanged, and the same field name is reused,
> >> >> >>>
> >> >> >>>          the ordering of the docs will be consistent.
> >> >> >>>
> >> >> >>>          If you want different psuedo-random orderings of
> >> documents,
> >> >> >>>
> >> >> >>>          for the same version of the index, use a dynamicField
> and
> >> >> >>>
> >> >> >>>          change the field name in the request.
> >> >> >>>
> >> >> >>>      -->
> >> >> >>>
> >> >> >>> <fieldType name="random" class="solr.RandomSortField"
> >> indexed="true" />
> >> >> >>>
> >> >> >>>
> >> >> >>> Joel Bernstein
> >> >> >>> http://joelsolr.blogspot.com/
> >> >> >>>
> >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <
> joelsolr@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > You'll need to have this field in your schema:
> >> >> >>> >
> >> >> >>> > <dynamicField name="random_*" type="random" />
> >> >> >>> >
> >> >> >>> > I'll check to see if the default schema used with solr start -c
> >> has
> >> >> >>> this
> >> >> >>> > field, if not I'll add it. Thanks for pointing this out.
> >> >> >>> >
> >> >> >>> > I checked and right now the random expression is only accepting
> >> one
> >> >> fq,
> >> >> >>> > but I consider this a bug. It should accept multiple. I'll
> create
> >> >> >>> ticket
> >> >> >>> > for getting this fixed.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Joel Bernstein
> >> >> >>> > http://joelsolr.blogspot.com/
> >> >> >>> >
> >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <
> localdevjs@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Joel, thanks for the pointers to the streaming feature. I had
> no
> >> >> idea
> >> >> >>> solr
> >> >> >>> >> had that (and also just discovered the very intersting sql
> >> feature!
> >> >> I
> >> >> >>> will
> >> >> >>> >> be sure to investigate that in more detail in the future).
> >> >> >>> >>
> >> >> >>> >> However I'm having some trouble getting basic streaming
> >> functions
> >> >> >>> working.
> >> >> >>> >> I've already figured out that I had to move to "solr cloud"
> >> instead
> >> >> of
> >> >> >>> >> "solr standalone" because I was getting errors about "cannot
> >> find zk
> >> >> >>> >> instance" or whatever which went away when using "solr start
> -c"
> >> >> >>> instead.
> >> >> >>> >>
> >> >> >>> >> But now I'm trying to use the random function since that was
> >> one of
> >> >> >>> the
> >> >> >>> >> functions used in your example.
> >> >> >>> >>
> >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >> >> >>> >>
> >> >> >>> >> I posted that directly in the "stream" section of the solr
> >> admin UI.
> >> >> >>> This
> >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several
> >> versions
> >> >> in
> >> >> >>> case
> >> >> >>> >> it was a bug in one)
> >> >> >>> >>
> >> >> >>> >> I get back an error message:
> >> >> >>> >> *sort param could not be parsed as a query, and is not a field
> >> that
> >> >> >>> exists
> >> >> >>> >> in the index: random_-255009774*
> >> >> >>> >>
> >> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs
> >> show
> >> >> >>> these
> >> >> >>> >> three log entries:
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> >> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> >> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
> >> >> >>> status=400
> >> >> >>> >> QTime=19
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >> >>> o.a.s.c.s.i.CloudSolrClient
> >> >> >>> >> Request to collection [tx_header] failed due to (400)
> >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> >> RemoteSolrException:
> >> >> >>> >> Error
> >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> >> param
> >> >> >>> could
> >> >> >>> >> not be parsed as a query, and is not a field that exists in
> the
> >> >> index:
> >> >> >>> >> random_-255009774, retry? 0
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >> >>> o.a.s.c.s.i.s.ExceptionStream
> >> >> >>> >> java.io.IOException:
> >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> >> RemoteSolrException:
> >> >> >>> >> Error
> >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> >> param
> >> >> >>> could
> >> >> >>> >> not be parsed as a query, and is not a field that exists in
> the
> >> >> index:
> >> >> >>> >> random_-255009774
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> So basically it looks like solr is injecting the
> "sort=random_"
> >> >> stuff
> >> >> >>> into
> >> >> >>> >> my query and of course that is failing on the search since
> that
> >> >> >>> >> field/column doesn't exist in my schema. Everytime I run the
> >> random
> >> >> >>> >> function, I get a slightly different field name that it
> >> injects, but
> >> >> >>> they
> >> >> >>> >> all start with "random_" etc.
> >> >> >>> >>
> >> >> >>> >> I have tried adding my own sort field instead, hoping solr
> >> wouldn't
> >> >> >>> inject
> >> >> >>> >> one for me, but it still injected a random sort fieldname:
> >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> >> >> >>> sort="countyname
> >> >> >>> >> asc")
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Assuming I can fix that whole problem, my second question is:
> >> can I
> >> >> >>> add
> >> >> >>> >> multiple "fq=" parameters to the random function? I build a
> >> pretty
> >> >> >>> >> complicated query using many fq= fields, and then want to run
> >> some
> >> >> >>> stats
> >> >> >>> >> on
> >> >> >>> >> that hitlist; so somehow I have to pass in the query that made
> >> up
> >> >> the
> >> >> >>> >> exact
> >> >> >>> >> hitlist to these various functions, but when I used multiple
> >> "fq="
> >> >> >>> values
> >> >> >>> >> it only seemed to use the last one I specified and just
> ignored
> >> all
> >> >> >>> the
> >> >> >>> >> previous fq's?
> >> >> >>> >>
> >> >> >>> >> Thanks in advance for any comments/suggestions...!
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
> >> joelsolr@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >> > This is going to be a complex answer because Solr actually
> >> now has
> >> >> >>> >> multiple
> >> >> >>> >> > ways of doing regression analysis as part of the Streaming
> >> >> >>> Expression
> >> >> >>> >> > statistical programming library. The basic documentation is
> >> here:
> >> >> >>> >> >
> >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-
> program
> >> >> >>> ming.html
> >> >> >>> >> >
> >> >> >>> >> > Here is a sample expression that performs a simple linear
> >> >> >>> regression in
> >> >> >>> >> > Solr 7.2:
> >> >> >>> >> >
> >> >> >>> >> > let(a=random(collection1, q="any query", rows="15000",
> >> fl="fieldA,
> >> >> >>> >> > fieldB"),
> >> >> >>> >> >     b=col(a, fieldA),
> >> >> >>> >> >     c=col(a, fieldB),
> >> >> >>> >> >     d=regress(b, c))
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > The expression above takes a random sample of 15000 results
> >> from
> >> >> >>> >> > collection1. The result set will include fieldA and fieldB
> in
> >> each
> >> >> >>> >> record.
> >> >> >>> >> > The result set is stored in variable "a".
> >> >> >>> >> >
> >> >> >>> >> > Then the "col" function creates arrays of numbers from the
> >> results
> >> >> >>> >> stored
> >> >> >>> >> > in variable a. The values in fieldA are stored in the
> variable
> >> >> "b".
> >> >> >>> The
> >> >> >>> >> > values in fieldB are stored in variable "c".
> >> >> >>> >> >
> >> >> >>> >> > Then the regress function performs a simple linear
> regression
> >> on
> >> >> >>> arrays
> >> >> >>> >> > stored in variables "b" and "c".
> >> >> >>> >> >
> >> >> >>> >> > The output of the regress function is a map containing the
> >> >> >>> regression
> >> >> >>> >> > result. This result includes RSquared and other attributes
> of
> >> the
> >> >> >>> >> > regression model such as R (correlation), slope, y intercept
> >> >> etc...
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > Joel Bernstein
> >> >> >>> >> > http://joelsolr.blogspot.com/
> >> >> >>> >> >
> >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
> >> localdevjs@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> >
> >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats
> guy,
> >> but
> >> >> >>> the
> >> >> >>> >> end
> >> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is
> >> there no
> >> >> >>> way of
> >> >> >>> >> > > obtaining this value, then (short of iterating over all
> the
> >> >> >>> results in
> >> >> >>> >> > the
> >> >> >>> >> > > hitlist and calculating it myself)?
> >> >> >>> >> > >
> >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> >> >> >>> joelsolr@gmail.com>
> >> >> >>> >> > > wrote:
> >> >> >>> >> > >
> >> >> >>> >> > > > Typically SSE is the sum of the squared errors of the
> >> >> >>> prediction in
> >> >> >>> >> a
> >> >> >>> >> > > > regression analysis. The stats component doesn't perform
> >> >> >>> regression,
> >> >> >>> >> > > > although it might be a nice feature.
> >> >> >>> >> > > >
> >> >> >>> >> > > >
> >> >> >>> >> > > >
> >> >> >>> >> > > > Joel Bernstein
> >> >> >>> >> > > > http://joelsolr.blogspot.com/
> >> >> >>> >> > > >
> >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> >> >> >>> localdevjs@gmail.com>
> >> >> >>> >> > > wrote:
> >> >> >>> >> > > >
> >> >> >>> >> > > > > I'm using solr, and enabling stats as per this page:
> >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> >> >> component
> >> >> >>> .html
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > I want to get more stat values though. Specifically
> I'm
> >> >> >>> looking
> >> >> >>> >> for
> >> >> >>> >> > > > > r-squared (coefficient of determination). This value
> is
> >> not
> >> >> >>> >> present
> >> >> >>> >> > in
> >> >> >>> >> > > > > solr, however some of the pieces used to calculate r^2
> >> are
> >> >> in
> >> >> >>> the
> >> >> >>> >> > stats
> >> >> >>> >> > > > > element, for example:
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > <double name="min">0.0</double>
> >> >> >>> >> > > > > <double name="max">10.0</double>
> >> >> >>> >> > > > > <long name="count">15</long>
> >> >> >>> >> > > > > <long name="missing">17</long>
> >> >> >>> >> > > > > <double name="sum">85.0</double>
> >> >> >>> >> > > > > <double name="sumOfSquares">603.0</double>
> >> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
> >> >> >>> >> > > > > <double name="stddev">2.943920288775949</double>
> >> >> >>> >> > > > >
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and using
> >> this
> >> >> >>> >> > > calculation, I
> >> >> >>> >> > > > > can get R^2:
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > R^2 = 1 - SSE/SST
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE
> >> from
> >> >> >>> those
> >> >> >>> >> > other
> >> >> >>> >> > > > > stats in solr?
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > Thanks in advance!
> >> >> >>> >> > > > >
> >> >> >>> >> > > >
> >> >> >>> >> > >
> >> >> >>> >> >
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >>
> >
> >
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
I've been working on the user guide for the math expressions. Here is the
page on regression:

https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/regression.adoc

This page is part of the larger math expression documentation. The TOC is
here:

https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/math-expressions.adoc

The docs are still very rough but you can get an idea of the coverage.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <jo...@gmail.com> wrote:

> If you want to get everything in query you can do this:
>
> let(echo="d,e",
>      a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> fq="isParent:true", rows="1500000",
> fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
> asc"),
>      b=col(a, oil_first_90_days_production),
>      c=col(a, oil_last_30_days_production),
>      d=regress(b, c),
>      e=someExpression())
>
> The echo parameter tells the let expression which variables to output.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> What does the fq clause look like?
>>
>> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <lo...@gmail.com>
>> wrote:
>> > Hi Joel, I did some more work on this statistics stuff today. Yes, we do
>> > have nulls in our data; the document contains many fields, we don't
>> always
>> > have values for each field, but we can't set the nulls to 0 either (or
>> any
>> > other value, really) as that will mess up other calculations (such as
>> when
>> > calculating average etc); we would normally just ignore fields with null
>> > values when calculating stats manually ourselves.
>> >
>> > Adding a check in the "q" parameter to ensure that the fields used in
>> the
>> > calculations are > 0 does work now. Thanks for the tip (and sorry,
>> should
>> > have caught that myself). But I am unable to use "fq" for these checks,
>> > they have to be added to the q instead. Adding fq's doesn't have any
>> effect.
>> >
>> >
>> > Anyway, I'm trying to change this up a little. This is what I'm
>> currently
>> > using (switched from "random" to "search" since I actually need the full
>> > hitlist not just a random subset):
>> >
>> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
>> *]",
>> > fq="isParent:true", rows="1500000",
>> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
>> sort="id
>> > asc"),
>> >      b=col(a, oil_first_90_days_production),
>> >      c=col(a, oil_last_30_days_production),
>> >      d=regress(b, c))
>> >
>> > So I have 2 fields there defined, that works great (in terms of a test
>> and
>> > running the query); but I need to replace the second field,
>> > "oil_last_30_days_production" with the avg value in
>> > oil_first_90_days_production.
>> >
>> > I can get the avg with this expression:
>> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
>> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_production))
>> >
>> > But I don't know how to push that avg value into the first streaming
>> > expression; guessing I have to set "c=...." but that is where I'm
>> getting
>> > lost, since avg only returns 1 value and the first parameter, "b",
>> returns
>> > a list of sorts. Somehow I have to get the avg value stuffed inside a
>> > "col", where it is the same value for every row in the hitlist...?
>> >
>> > Thanks for your help!
>> >
>> >
>> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>> >
>> >> I suspect you've got nulls in your data. I just tested with null
>> values and
>> >> got the same error. For testing purposes try loading the data with
>> default
>> >> values of zero.
>> >>
>> >>
>> >> Joel Bernstein
>> >> http://joelsolr.blogspot.com/
>> >>
>> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com>
>> >> wrote:
>> >>
>> >> > Let's break the expression down and build it up slowly. Let's start
>> with:
>> >> >
>> >> > let(echo="true",
>> >> >      a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> rows="15",
>> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
>> >> >      b=col(a, oil_first_90_days_production))
>> >> >
>> >> >
>> >> > This should return variables a and b. Let's see what the data looks
>> like.
>> >> > I changed the rows from 15 to 15000. If it all looks good we can
>> expand
>> >> the
>> >> > rows and continue adding functions.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Joel Bernstein
>> >> > http://joelsolr.blogspot.com/
>> >> >
>> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com>
>> wrote:
>> >> >
>> >> >> Thanks Joel for your help on this.
>> >> >>
>> >> >> What I've done so far:
>> >> >> - unzip downloaded solr-7.2
>> >> >> - modify the _default "managed-schema" to add the random field type
>> and
>> >> >> the dynamic random field
>> >> >> - start solr7 using "solr start -c"
>> >> >> - indexed my data using pint/pdouble/boolean field types etc
>> >> >>
>> >> >> I can now run the random function all by itself, it returns random
>> >> >> results as expected. So far so good!
>> >> >>
>> >> >> However... now trying to get the regression stuff working:
>> >> >>
>> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> >> >> rows="15000", fl="oil_first_90_days_producti
>> >> >> on,oil_last_30_days_production"),
>> >> >>     b=col(a, oil_first_90_days_production),
>> >> >>     c=col(a, oil_last_30_days_production),
>> >> >>     d=regress(b, c))
>> >> >>
>> >> >> Posted directly into solr admin UI. Run the streaming expression
>> and I
>> >> >> get this error message:
>> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric
>> value
>> >> >> expected but found type java.lang.String for value
>> >> >> oil_first_90_days_production"
>> >> >>
>> >> >> It thinks my numeric field is defined as a string? But when I view
>> the
>> >> >> schema, those 2 fields are defined as ints:
>> >> >>
>> >> >>
>> >> >> When I run a normal query and choose xml as output format, then it
>> also
>> >> >> puts "int" elements into the hitlist, so the schema appears to be
>> >> correct
>> >> >> it's just when using this regress function that something goes
>> wrong and
>> >> >> solr thinks the field is string.
>> >> >>
>> >> >> Any suggestions?
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> The field type will also need to be in the schema:
>> >> >>>
>> >> >>>  <!-- The "RandomSortField" is not used to store or search any
>> >> >>>
>> >> >>>          data.  You can declare fields of this type it in your
>> schema
>> >> >>>
>> >> >>>          to generate pseudo-random orderings of your docs for
>> sorting
>> >> >>>
>> >> >>>          or function purposes.  The ordering is generated based on
>> the
>> >> >>> field
>> >> >>>
>> >> >>>          name and the version of the index. As long as the index
>> >> version
>> >> >>>
>> >> >>>          remains unchanged, and the same field name is reused,
>> >> >>>
>> >> >>>          the ordering of the docs will be consistent.
>> >> >>>
>> >> >>>          If you want different psuedo-random orderings of
>> documents,
>> >> >>>
>> >> >>>          for the same version of the index, use a dynamicField and
>> >> >>>
>> >> >>>          change the field name in the request.
>> >> >>>
>> >> >>>      -->
>> >> >>>
>> >> >>> <fieldType name="random" class="solr.RandomSortField"
>> indexed="true" />
>> >> >>>
>> >> >>>
>> >> >>> Joel Bernstein
>> >> >>> http://joelsolr.blogspot.com/
>> >> >>>
>> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <joelsolr@gmail.com
>> >
>> >> >>> wrote:
>> >> >>>
>> >> >>> > You'll need to have this field in your schema:
>> >> >>> >
>> >> >>> > <dynamicField name="random_*" type="random" />
>> >> >>> >
>> >> >>> > I'll check to see if the default schema used with solr start -c
>> has
>> >> >>> this
>> >> >>> > field, if not I'll add it. Thanks for pointing this out.
>> >> >>> >
>> >> >>> > I checked and right now the random expression is only accepting
>> one
>> >> fq,
>> >> >>> > but I consider this a bug. It should accept multiple. I'll create
>> >> >>> ticket
>> >> >>> > for getting this fixed.
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > Joel Bernstein
>> >> >>> > http://joelsolr.blogspot.com/
>> >> >>> >
>> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <localdevjs@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> Joel, thanks for the pointers to the streaming feature. I had no
>> >> idea
>> >> >>> solr
>> >> >>> >> had that (and also just discovered the very intersting sql
>> feature!
>> >> I
>> >> >>> will
>> >> >>> >> be sure to investigate that in more detail in the future).
>> >> >>> >>
>> >> >>> >> However I'm having some trouble getting basic streaming
>> functions
>> >> >>> working.
>> >> >>> >> I've already figured out that I had to move to "solr cloud"
>> instead
>> >> of
>> >> >>> >> "solr standalone" because I was getting errors about "cannot
>> find zk
>> >> >>> >> instance" or whatever which went away when using "solr start -c"
>> >> >>> instead.
>> >> >>> >>
>> >> >>> >> But now I'm trying to use the random function since that was
>> one of
>> >> >>> the
>> >> >>> >> functions used in your example.
>> >> >>> >>
>> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>> >> >>> >>
>> >> >>> >> I posted that directly in the "stream" section of the solr
>> admin UI.
>> >> >>> This
>> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several
>> versions
>> >> in
>> >> >>> case
>> >> >>> >> it was a bug in one)
>> >> >>> >>
>> >> >>> >> I get back an error message:
>> >> >>> >> *sort param could not be parsed as a query, and is not a field
>> that
>> >> >>> exists
>> >> >>> >> in the index: random_-255009774*
>> >> >>> >>
>> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs
>> show
>> >> >>> these
>> >> >>> >> three log entries:
>> >> >>> >>
>> >> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
>> >> s:shard1
>> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> >> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>> >> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
>> >> >>> status=400
>> >> >>> >> QTime=19
>> >> >>> >>
>> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
>> >> s:shard1
>> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >> >>> o.a.s.c.s.i.CloudSolrClient
>> >> >>> >> Request to collection [tx_header] failed due to (400)
>> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> >> RemoteSolrException:
>> >> >>> >> Error
>> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
>> param
>> >> >>> could
>> >> >>> >> not be parsed as a query, and is not a field that exists in the
>> >> index:
>> >> >>> >> random_-255009774, retry? 0
>> >> >>> >>
>> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
>> >> s:shard1
>> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >> >>> o.a.s.c.s.i.s.ExceptionStream
>> >> >>> >> java.io.IOException:
>> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> >> RemoteSolrException:
>> >> >>> >> Error
>> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
>> param
>> >> >>> could
>> >> >>> >> not be parsed as a query, and is not a field that exists in the
>> >> index:
>> >> >>> >> random_-255009774
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> So basically it looks like solr is injecting the "sort=random_"
>> >> stuff
>> >> >>> into
>> >> >>> >> my query and of course that is failing on the search since that
>> >> >>> >> field/column doesn't exist in my schema. Everytime I run the
>> random
>> >> >>> >> function, I get a slightly different field name that it
>> injects, but
>> >> >>> they
>> >> >>> >> all start with "random_" etc.
>> >> >>> >>
>> >> >>> >> I have tried adding my own sort field instead, hoping solr
>> wouldn't
>> >> >>> inject
>> >> >>> >> one for me, but it still injected a random sort fieldname:
>> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
>> >> >>> sort="countyname
>> >> >>> >> asc")
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> Assuming I can fix that whole problem, my second question is:
>> can I
>> >> >>> add
>> >> >>> >> multiple "fq=" parameters to the random function? I build a
>> pretty
>> >> >>> >> complicated query using many fq= fields, and then want to run
>> some
>> >> >>> stats
>> >> >>> >> on
>> >> >>> >> that hitlist; so somehow I have to pass in the query that made
>> up
>> >> the
>> >> >>> >> exact
>> >> >>> >> hitlist to these various functions, but when I used multiple
>> "fq="
>> >> >>> values
>> >> >>> >> it only seemed to use the last one I specified and just ignored
>> all
>> >> >>> the
>> >> >>> >> previous fq's?
>> >> >>> >>
>> >> >>> >> Thanks in advance for any comments/suggestions...!
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
>> joelsolr@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >>
>> >> >>> >> > This is going to be a complex answer because Solr actually
>> now has
>> >> >>> >> multiple
>> >> >>> >> > ways of doing regression analysis as part of the Streaming
>> >> >>> Expression
>> >> >>> >> > statistical programming library. The basic documentation is
>> here:
>> >> >>> >> >
>> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
>> >> >>> ming.html
>> >> >>> >> >
>> >> >>> >> > Here is a sample expression that performs a simple linear
>> >> >>> regression in
>> >> >>> >> > Solr 7.2:
>> >> >>> >> >
>> >> >>> >> > let(a=random(collection1, q="any query", rows="15000",
>> fl="fieldA,
>> >> >>> >> > fieldB"),
>> >> >>> >> >     b=col(a, fieldA),
>> >> >>> >> >     c=col(a, fieldB),
>> >> >>> >> >     d=regress(b, c))
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > The expression above takes a random sample of 15000 results
>> from
>> >> >>> >> > collection1. The result set will include fieldA and fieldB in
>> each
>> >> >>> >> record.
>> >> >>> >> > The result set is stored in variable "a".
>> >> >>> >> >
>> >> >>> >> > Then the "col" function creates arrays of numbers from the
>> results
>> >> >>> >> stored
>> >> >>> >> > in variable a. The values in fieldA are stored in the variable
>> >> "b".
>> >> >>> The
>> >> >>> >> > values in fieldB are stored in variable "c".
>> >> >>> >> >
>> >> >>> >> > Then the regress function performs a simple linear regression
>> on
>> >> >>> arrays
>> >> >>> >> > stored in variables "b" and "c".
>> >> >>> >> >
>> >> >>> >> > The output of the regress function is a map containing the
>> >> >>> regression
>> >> >>> >> > result. This result includes RSquared and other attributes of
>> the
>> >> >>> >> > regression model such as R (correlation), slope, y intercept
>> >> etc...
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > Joel Bernstein
>> >> >>> >> > http://joelsolr.blogspot.com/
>> >> >>> >> >
>> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
>> localdevjs@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy,
>> but
>> >> >>> the
>> >> >>> >> end
>> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is
>> there no
>> >> >>> way of
>> >> >>> >> > > obtaining this value, then (short of iterating over all the
>> >> >>> results in
>> >> >>> >> > the
>> >> >>> >> > > hitlist and calculating it myself)?
>> >> >>> >> > >
>> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
>> >> >>> joelsolr@gmail.com>
>> >> >>> >> > > wrote:
>> >> >>> >> > >
>> >> >>> >> > > > Typically SSE is the sum of the squared errors of the
>> >> >>> prediction in
>> >> >>> >> a
>> >> >>> >> > > > regression analysis. The stats component doesn't perform
>> >> >>> regression,
>> >> >>> >> > > > although it might be a nice feature.
>> >> >>> >> > > >
>> >> >>> >> > > >
>> >> >>> >> > > >
>> >> >>> >> > > > Joel Bernstein
>> >> >>> >> > > > http://joelsolr.blogspot.com/
>> >> >>> >> > > >
>> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
>> >> >>> localdevjs@gmail.com>
>> >> >>> >> > > wrote:
>> >> >>> >> > > >
>> >> >>> >> > > > > I'm using solr, and enabling stats as per this page:
>> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
>> >> component
>> >> >>> .html
>> >> >>> >> > > > >
>> >> >>> >> > > > > I want to get more stat values though. Specifically I'm
>> >> >>> looking
>> >> >>> >> for
>> >> >>> >> > > > > r-squared (coefficient of determination). This value is
>> not
>> >> >>> >> present
>> >> >>> >> > in
>> >> >>> >> > > > > solr, however some of the pieces used to calculate r^2
>> are
>> >> in
>> >> >>> the
>> >> >>> >> > stats
>> >> >>> >> > > > > element, for example:
>> >> >>> >> > > > >
>> >> >>> >> > > > > <double name="min">0.0</double>
>> >> >>> >> > > > > <double name="max">10.0</double>
>> >> >>> >> > > > > <long name="count">15</long>
>> >> >>> >> > > > > <long name="missing">17</long>
>> >> >>> >> > > > > <double name="sum">85.0</double>
>> >> >>> >> > > > > <double name="sumOfSquares">603.0</double>
>> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
>> >> >>> >> > > > > <double name="stddev">2.943920288775949</double>
>> >> >>> >> > > > >
>> >> >>> >> > > > >
>> >> >>> >> > > > > So I have the sumOfSquares available (SST), and using
>> this
>> >> >>> >> > > calculation, I
>> >> >>> >> > > > > can get R^2:
>> >> >>> >> > > > >
>> >> >>> >> > > > > R^2 = 1 - SSE/SST
>> >> >>> >> > > > >
>> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE
>> from
>> >> >>> those
>> >> >>> >> > other
>> >> >>> >> > > > > stats in solr?
>> >> >>> >> > > > >
>> >> >>> >> > > > > Thanks in advance!
>> >> >>> >> > > > >
>> >> >>> >> > > >
>> >> >>> >> > >
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>>
>
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
If you want to get everything in query you can do this:

let(echo="d,e",
     a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="1500000",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
     b=col(a, oil_first_90_days_production),
     c=col(a, oil_last_30_days_production),
     d=regress(b, c),
     e=someExpression())

The echo parameter tells the let expression which variables to output.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <er...@gmail.com>
wrote:

> What does the fq clause look like?
>
> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <lo...@gmail.com> wrote:
> > Hi Joel, I did some more work on this statistics stuff today. Yes, we do
> > have nulls in our data; the document contains many fields, we don't
> always
> > have values for each field, but we can't set the nulls to 0 either (or
> any
> > other value, really) as that will mess up other calculations (such as
> when
> > calculating average etc); we would normally just ignore fields with null
> > values when calculating stats manually ourselves.
> >
> > Adding a check in the "q" parameter to ensure that the fields used in the
> > calculations are > 0 does work now. Thanks for the tip (and sorry, should
> > have caught that myself). But I am unable to use "fq" for these checks,
> > they have to be added to the q instead. Adding fq's doesn't have any
> effect.
> >
> >
> > Anyway, I'm trying to change this up a little. This is what I'm currently
> > using (switched from "random" to "search" since I actually need the full
> > hitlist not just a random subset):
> >
> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > fq="isParent:true", rows="1500000",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >      b=col(a, oil_first_90_days_production),
> >      c=col(a, oil_last_30_days_production),
> >      d=regress(b, c))
> >
> > So I have 2 fields there defined, that works great (in terms of a test
> and
> > running the query); but I need to replace the second field,
> > "oil_last_30_days_production" with the avg value in
> > oil_first_90_days_production.
> >
> > I can get the avg with this expression:
> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_production))
> >
> > But I don't know how to push that avg value into the first streaming
> > expression; guessing I have to set "c=...." but that is where I'm getting
> > lost, since avg only returns 1 value and the first parameter, "b",
> returns
> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> > "col", where it is the same value for every row in the hitlist...?
> >
> > Thanks for your help!
> >
> >
> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> >> I suspect you've got nulls in your data. I just tested with null values
> and
> >> got the same error. For testing purposes try loading the data with
> default
> >> values of zero.
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >>
> >> > Let's break the expression down and build it up slowly. Let's start
> with:
> >> >
> >> > let(echo="true",
> >> >      a=random(tx_prod_production, q="*:*", fq="isParent:true",
> rows="15",
> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >> >      b=col(a, oil_first_90_days_production))
> >> >
> >> >
> >> > This should return variables a and b. Let's see what the data looks
> like.
> >> > I changed the rows from 15 to 15000. If it all looks good we can
> expand
> >> the
> >> > rows and continue adding functions.
> >> >
> >> >
> >> >
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com>
> wrote:
> >> >
> >> >> Thanks Joel for your help on this.
> >> >>
> >> >> What I've done so far:
> >> >> - unzip downloaded solr-7.2
> >> >> - modify the _default "managed-schema" to add the random field type
> and
> >> >> the dynamic random field
> >> >> - start solr7 using "solr start -c"
> >> >> - indexed my data using pint/pdouble/boolean field types etc
> >> >>
> >> >> I can now run the random function all by itself, it returns random
> >> >> results as expected. So far so good!
> >> >>
> >> >> However... now trying to get the regression stuff working:
> >> >>
> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> >> rows="15000", fl="oil_first_90_days_producti
> >> >> on,oil_last_30_days_production"),
> >> >>     b=col(a, oil_first_90_days_production),
> >> >>     c=col(a, oil_last_30_days_production),
> >> >>     d=regress(b, c))
> >> >>
> >> >> Posted directly into solr admin UI. Run the streaming expression and
> I
> >> >> get this error message:
> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric
> value
> >> >> expected but found type java.lang.String for value
> >> >> oil_first_90_days_production"
> >> >>
> >> >> It thinks my numeric field is defined as a string? But when I view
> the
> >> >> schema, those 2 fields are defined as ints:
> >> >>
> >> >>
> >> >> When I run a normal query and choose xml as output format, then it
> also
> >> >> puts "int" elements into the hitlist, so the schema appears to be
> >> correct
> >> >> it's just when using this regress function that something goes wrong
> and
> >> >> solr thinks the field is string.
> >> >>
> >> >> Any suggestions?
> >> >> Thanks!
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> The field type will also need to be in the schema:
> >> >>>
> >> >>>  <!-- The "RandomSortField" is not used to store or search any
> >> >>>
> >> >>>          data.  You can declare fields of this type it in your
> schema
> >> >>>
> >> >>>          to generate pseudo-random orderings of your docs for
> sorting
> >> >>>
> >> >>>          or function purposes.  The ordering is generated based on
> the
> >> >>> field
> >> >>>
> >> >>>          name and the version of the index. As long as the index
> >> version
> >> >>>
> >> >>>          remains unchanged, and the same field name is reused,
> >> >>>
> >> >>>          the ordering of the docs will be consistent.
> >> >>>
> >> >>>          If you want different psuedo-random orderings of documents,
> >> >>>
> >> >>>          for the same version of the index, use a dynamicField and
> >> >>>
> >> >>>          change the field name in the request.
> >> >>>
> >> >>>      -->
> >> >>>
> >> >>> <fieldType name="random" class="solr.RandomSortField"
> indexed="true" />
> >> >>>
> >> >>>
> >> >>> Joel Bernstein
> >> >>> http://joelsolr.blogspot.com/
> >> >>>
> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>> > You'll need to have this field in your schema:
> >> >>> >
> >> >>> > <dynamicField name="random_*" type="random" />
> >> >>> >
> >> >>> > I'll check to see if the default schema used with solr start -c
> has
> >> >>> this
> >> >>> > field, if not I'll add it. Thanks for pointing this out.
> >> >>> >
> >> >>> > I checked and right now the random expression is only accepting
> one
> >> fq,
> >> >>> > but I consider this a bug. It should accept multiple. I'll create
> >> >>> ticket
> >> >>> > for getting this fixed.
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > Joel Bernstein
> >> >>> > http://joelsolr.blogspot.com/
> >> >>> >
> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com>
> >> >>> wrote:
> >> >>> >
> >> >>> >> Joel, thanks for the pointers to the streaming feature. I had no
> >> idea
> >> >>> solr
> >> >>> >> had that (and also just discovered the very intersting sql
> feature!
> >> I
> >> >>> will
> >> >>> >> be sure to investigate that in more detail in the future).
> >> >>> >>
> >> >>> >> However I'm having some trouble getting basic streaming functions
> >> >>> working.
> >> >>> >> I've already figured out that I had to move to "solr cloud"
> instead
> >> of
> >> >>> >> "solr standalone" because I was getting errors about "cannot
> find zk
> >> >>> >> instance" or whatever which went away when using "solr start -c"
> >> >>> instead.
> >> >>> >>
> >> >>> >> But now I'm trying to use the random function since that was one
> of
> >> >>> the
> >> >>> >> functions used in your example.
> >> >>> >>
> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >> >>> >>
> >> >>> >> I posted that directly in the "stream" section of the solr admin
> UI.
> >> >>> This
> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several
> versions
> >> in
> >> >>> case
> >> >>> >> it was a bug in one)
> >> >>> >>
> >> >>> >> I get back an error message:
> >> >>> >> *sort param could not be parsed as a query, and is not a field
> that
> >> >>> exists
> >> >>> >> in the index: random_-255009774*
> >> >>> >>
> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs
> show
> >> >>> these
> >> >>> >> three log entries:
> >> >>> >>
> >> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
> >> s:shard1
> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
> >> >>> status=400
> >> >>> >> QTime=19
> >> >>> >>
> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
> >> s:shard1
> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >>> o.a.s.c.s.i.CloudSolrClient
> >> >>> >> Request to collection [tx_header] failed due to (400)
> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> RemoteSolrException:
> >> >>> >> Error
> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> param
> >> >>> could
> >> >>> >> not be parsed as a query, and is not a field that exists in the
> >> index:
> >> >>> >> random_-255009774, retry? 0
> >> >>> >>
> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
> >> s:shard1
> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >>> o.a.s.c.s.i.s.ExceptionStream
> >> >>> >> java.io.IOException:
> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> RemoteSolrException:
> >> >>> >> Error
> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> param
> >> >>> could
> >> >>> >> not be parsed as a query, and is not a field that exists in the
> >> index:
> >> >>> >> random_-255009774
> >> >>> >>
> >> >>> >>
> >> >>> >> So basically it looks like solr is injecting the "sort=random_"
> >> stuff
> >> >>> into
> >> >>> >> my query and of course that is failing on the search since that
> >> >>> >> field/column doesn't exist in my schema. Everytime I run the
> random
> >> >>> >> function, I get a slightly different field name that it injects,
> but
> >> >>> they
> >> >>> >> all start with "random_" etc.
> >> >>> >>
> >> >>> >> I have tried adding my own sort field instead, hoping solr
> wouldn't
> >> >>> inject
> >> >>> >> one for me, but it still injected a random sort fieldname:
> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> >> >>> sort="countyname
> >> >>> >> asc")
> >> >>> >>
> >> >>> >>
> >> >>> >> Assuming I can fix that whole problem, my second question is:
> can I
> >> >>> add
> >> >>> >> multiple "fq=" parameters to the random function? I build a
> pretty
> >> >>> >> complicated query using many fq= fields, and then want to run
> some
> >> >>> stats
> >> >>> >> on
> >> >>> >> that hitlist; so somehow I have to pass in the query that made up
> >> the
> >> >>> >> exact
> >> >>> >> hitlist to these various functions, but when I used multiple
> "fq="
> >> >>> values
> >> >>> >> it only seemed to use the last one I specified and just ignored
> all
> >> >>> the
> >> >>> >> previous fq's?
> >> >>> >>
> >> >>> >> Thanks in advance for any comments/suggestions...!
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
> joelsolr@gmail.com
> >> >
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > This is going to be a complex answer because Solr actually now
> has
> >> >>> >> multiple
> >> >>> >> > ways of doing regression analysis as part of the Streaming
> >> >>> Expression
> >> >>> >> > statistical programming library. The basic documentation is
> here:
> >> >>> >> >
> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
> >> >>> ming.html
> >> >>> >> >
> >> >>> >> > Here is a sample expression that performs a simple linear
> >> >>> regression in
> >> >>> >> > Solr 7.2:
> >> >>> >> >
> >> >>> >> > let(a=random(collection1, q="any query", rows="15000",
> fl="fieldA,
> >> >>> >> > fieldB"),
> >> >>> >> >     b=col(a, fieldA),
> >> >>> >> >     c=col(a, fieldB),
> >> >>> >> >     d=regress(b, c))
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > The expression above takes a random sample of 15000 results
> from
> >> >>> >> > collection1. The result set will include fieldA and fieldB in
> each
> >> >>> >> record.
> >> >>> >> > The result set is stored in variable "a".
> >> >>> >> >
> >> >>> >> > Then the "col" function creates arrays of numbers from the
> results
> >> >>> >> stored
> >> >>> >> > in variable a. The values in fieldA are stored in the variable
> >> "b".
> >> >>> The
> >> >>> >> > values in fieldB are stored in variable "c".
> >> >>> >> >
> >> >>> >> > Then the regress function performs a simple linear regression
> on
> >> >>> arrays
> >> >>> >> > stored in variables "b" and "c".
> >> >>> >> >
> >> >>> >> > The output of the regress function is a map containing the
> >> >>> regression
> >> >>> >> > result. This result includes RSquared and other attributes of
> the
> >> >>> >> > regression model such as R (correlation), slope, y intercept
> >> etc...
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > Joel Bernstein
> >> >>> >> > http://joelsolr.blogspot.com/
> >> >>> >> >
> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
> localdevjs@gmail.com
> >> >
> >> >>> >> wrote:
> >> >>> >> >
> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy,
> but
> >> >>> the
> >> >>> >> end
> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is there
> no
> >> >>> way of
> >> >>> >> > > obtaining this value, then (short of iterating over all the
> >> >>> results in
> >> >>> >> > the
> >> >>> >> > > hitlist and calculating it myself)?
> >> >>> >> > >
> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> >> >>> joelsolr@gmail.com>
> >> >>> >> > > wrote:
> >> >>> >> > >
> >> >>> >> > > > Typically SSE is the sum of the squared errors of the
> >> >>> prediction in
> >> >>> >> a
> >> >>> >> > > > regression analysis. The stats component doesn't perform
> >> >>> regression,
> >> >>> >> > > > although it might be a nice feature.
> >> >>> >> > > >
> >> >>> >> > > >
> >> >>> >> > > >
> >> >>> >> > > > Joel Bernstein
> >> >>> >> > > > http://joelsolr.blogspot.com/
> >> >>> >> > > >
> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> >> >>> localdevjs@gmail.com>
> >> >>> >> > > wrote:
> >> >>> >> > > >
> >> >>> >> > > > > I'm using solr, and enabling stats as per this page:
> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> >> component
> >> >>> .html
> >> >>> >> > > > >
> >> >>> >> > > > > I want to get more stat values though. Specifically I'm
> >> >>> looking
> >> >>> >> for
> >> >>> >> > > > > r-squared (coefficient of determination). This value is
> not
> >> >>> >> present
> >> >>> >> > in
> >> >>> >> > > > > solr, however some of the pieces used to calculate r^2
> are
> >> in
> >> >>> the
> >> >>> >> > stats
> >> >>> >> > > > > element, for example:
> >> >>> >> > > > >
> >> >>> >> > > > > <double name="min">0.0</double>
> >> >>> >> > > > > <double name="max">10.0</double>
> >> >>> >> > > > > <long name="count">15</long>
> >> >>> >> > > > > <long name="missing">17</long>
> >> >>> >> > > > > <double name="sum">85.0</double>
> >> >>> >> > > > > <double name="sumOfSquares">603.0</double>
> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
> >> >>> >> > > > > <double name="stddev">2.943920288775949</double>
> >> >>> >> > > > >
> >> >>> >> > > > >
> >> >>> >> > > > > So I have the sumOfSquares available (SST), and using
> this
> >> >>> >> > > calculation, I
> >> >>> >> > > > > can get R^2:
> >> >>> >> > > > >
> >> >>> >> > > > > R^2 = 1 - SSE/SST
> >> >>> >> > > > >
> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE
> from
> >> >>> those
> >> >>> >> > other
> >> >>> >> > > > > stats in solr?
> >> >>> >> > > > >
> >> >>> >> > > > > Thanks in advance!
> >> >>> >> > > > >
> >> >>> >> > > >
> >> >>> >> > >
> >> >>> >> >
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
>

Re: statistics in hitlist

Posted by Erick Erickson <er...@gmail.com>.
What does the fq clause look like?

On Thu, Mar 15, 2018 at 11:51 AM, John Smith <lo...@gmail.com> wrote:
> Hi Joel, I did some more work on this statistics stuff today. Yes, we do
> have nulls in our data; the document contains many fields, we don't always
> have values for each field, but we can't set the nulls to 0 either (or any
> other value, really) as that will mess up other calculations (such as when
> calculating average etc); we would normally just ignore fields with null
> values when calculating stats manually ourselves.
>
> Adding a check in the "q" parameter to ensure that the fields used in the
> calculations are > 0 does work now. Thanks for the tip (and sorry, should
> have caught that myself). But I am unable to use "fq" for these checks,
> they have to be added to the q instead. Adding fq's doesn't have any effect.
>
>
> Anyway, I'm trying to change this up a little. This is what I'm currently
> using (switched from "random" to "search" since I actually need the full
> hitlist not just a random subset):
>
> let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="1500000",
> fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
> asc"),
>      b=col(a, oil_first_90_days_production),
>      c=col(a, oil_last_30_days_production),
>      d=regress(b, c))
>
> So I have 2 fields there defined, that works great (in terms of a test and
> running the query); but I need to replace the second field,
> "oil_last_30_days_production" with the avg value in
> oil_first_90_days_production.
>
> I can get the avg with this expression:
> stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="1500000", avg(oil_first_90_days_production))
>
> But I don't know how to push that avg value into the first streaming
> expression; guessing I have to set "c=...." but that is where I'm getting
> lost, since avg only returns 1 value and the first parameter, "b", returns
> a list of sorts. Somehow I have to get the avg value stuffed inside a
> "col", where it is the same value for every row in the hitlist...?
>
> Thanks for your help!
>
>
> On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <jo...@gmail.com> wrote:
>
>> I suspect you've got nulls in your data. I just tested with null values and
>> got the same error. For testing purposes try loading the data with default
>> values of zero.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>> > Let's break the expression down and build it up slowly. Let's start with:
>> >
>> > let(echo="true",
>> >      a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
>> > fl="oil_first_90_days_production,oil_last_30_days_production"),
>> >      b=col(a, oil_first_90_days_production))
>> >
>> >
>> > This should return variables a and b. Let's see what the data looks like.
>> > I changed the rows from 15 to 15000. If it all looks good we can expand
>> the
>> > rows and continue adding functions.
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com> wrote:
>> >
>> >> Thanks Joel for your help on this.
>> >>
>> >> What I've done so far:
>> >> - unzip downloaded solr-7.2
>> >> - modify the _default "managed-schema" to add the random field type and
>> >> the dynamic random field
>> >> - start solr7 using "solr start -c"
>> >> - indexed my data using pint/pdouble/boolean field types etc
>> >>
>> >> I can now run the random function all by itself, it returns random
>> >> results as expected. So far so good!
>> >>
>> >> However... now trying to get the regression stuff working:
>> >>
>> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> >> rows="15000", fl="oil_first_90_days_producti
>> >> on,oil_last_30_days_production"),
>> >>     b=col(a, oil_first_90_days_production),
>> >>     c=col(a, oil_last_30_days_production),
>> >>     d=regress(b, c))
>> >>
>> >> Posted directly into solr admin UI. Run the streaming expression and I
>> >> get this error message:
>> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
>> >> expected but found type java.lang.String for value
>> >> oil_first_90_days_production"
>> >>
>> >> It thinks my numeric field is defined as a string? But when I view the
>> >> schema, those 2 fields are defined as ints:
>> >>
>> >>
>> >> When I run a normal query and choose xml as output format, then it also
>> >> puts "int" elements into the hitlist, so the schema appears to be
>> correct
>> >> it's just when using this regress function that something goes wrong and
>> >> solr thinks the field is string.
>> >>
>> >> Any suggestions?
>> >> Thanks!
>> >>
>> >>
>> >>
>> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com>
>> >> wrote:
>> >>
>> >>> The field type will also need to be in the schema:
>> >>>
>> >>>  <!-- The "RandomSortField" is not used to store or search any
>> >>>
>> >>>          data.  You can declare fields of this type it in your schema
>> >>>
>> >>>          to generate pseudo-random orderings of your docs for sorting
>> >>>
>> >>>          or function purposes.  The ordering is generated based on the
>> >>> field
>> >>>
>> >>>          name and the version of the index. As long as the index
>> version
>> >>>
>> >>>          remains unchanged, and the same field name is reused,
>> >>>
>> >>>          the ordering of the docs will be consistent.
>> >>>
>> >>>          If you want different psuedo-random orderings of documents,
>> >>>
>> >>>          for the same version of the index, use a dynamicField and
>> >>>
>> >>>          change the field name in the request.
>> >>>
>> >>>      -->
>> >>>
>> >>> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>> >>>
>> >>>
>> >>> Joel Bernstein
>> >>> http://joelsolr.blogspot.com/
>> >>>
>> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > You'll need to have this field in your schema:
>> >>> >
>> >>> > <dynamicField name="random_*" type="random" />
>> >>> >
>> >>> > I'll check to see if the default schema used with solr start -c has
>> >>> this
>> >>> > field, if not I'll add it. Thanks for pointing this out.
>> >>> >
>> >>> > I checked and right now the random expression is only accepting one
>> fq,
>> >>> > but I consider this a bug. It should accept multiple. I'll create
>> >>> ticket
>> >>> > for getting this fixed.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Joel Bernstein
>> >>> > http://joelsolr.blogspot.com/
>> >>> >
>> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> Joel, thanks for the pointers to the streaming feature. I had no
>> idea
>> >>> solr
>> >>> >> had that (and also just discovered the very intersting sql feature!
>> I
>> >>> will
>> >>> >> be sure to investigate that in more detail in the future).
>> >>> >>
>> >>> >> However I'm having some trouble getting basic streaming functions
>> >>> working.
>> >>> >> I've already figured out that I had to move to "solr cloud" instead
>> of
>> >>> >> "solr standalone" because I was getting errors about "cannot find zk
>> >>> >> instance" or whatever which went away when using "solr start -c"
>> >>> instead.
>> >>> >>
>> >>> >> But now I'm trying to use the random function since that was one of
>> >>> the
>> >>> >> functions used in your example.
>> >>> >>
>> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>> >>> >>
>> >>> >> I posted that directly in the "stream" section of the solr admin UI.
>> >>> This
>> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions
>> in
>> >>> case
>> >>> >> it was a bug in one)
>> >>> >>
>> >>> >> I get back an error message:
>> >>> >> *sort param could not be parsed as a query, and is not a field that
>> >>> exists
>> >>> >> in the index: random_-255009774*
>> >>> >>
>> >>> >> I'm not passing in any sort field anywhere. But the solr logs show
>> >>> these
>> >>> >> three log entries:
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
>> >>> status=400
>> >>> >> QTime=19
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >>> o.a.s.c.s.i.CloudSolrClient
>> >>> >> Request to collection [tx_header] failed due to (400)
>> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:
>> >>> >> Error
>> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> >>> could
>> >>> >> not be parsed as a query, and is not a field that exists in the
>> index:
>> >>> >> random_-255009774, retry? 0
>> >>> >>
>> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
>> s:shard1
>> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> >>> o.a.s.c.s.i.s.ExceptionStream
>> >>> >> java.io.IOException:
>> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:
>> >>> >> Error
>> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> >>> could
>> >>> >> not be parsed as a query, and is not a field that exists in the
>> index:
>> >>> >> random_-255009774
>> >>> >>
>> >>> >>
>> >>> >> So basically it looks like solr is injecting the "sort=random_"
>> stuff
>> >>> into
>> >>> >> my query and of course that is failing on the search since that
>> >>> >> field/column doesn't exist in my schema. Everytime I run the random
>> >>> >> function, I get a slightly different field name that it injects, but
>> >>> they
>> >>> >> all start with "random_" etc.
>> >>> >>
>> >>> >> I have tried adding my own sort field instead, hoping solr wouldn't
>> >>> inject
>> >>> >> one for me, but it still injected a random sort fieldname:
>> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
>> >>> sort="countyname
>> >>> >> asc")
>> >>> >>
>> >>> >>
>> >>> >> Assuming I can fix that whole problem, my second question is: can I
>> >>> add
>> >>> >> multiple "fq=" parameters to the random function? I build a pretty
>> >>> >> complicated query using many fq= fields, and then want to run some
>> >>> stats
>> >>> >> on
>> >>> >> that hitlist; so somehow I have to pass in the query that made up
>> the
>> >>> >> exact
>> >>> >> hitlist to these various functions, but when I used multiple "fq="
>> >>> values
>> >>> >> it only seemed to use the last one I specified and just ignored all
>> >>> the
>> >>> >> previous fq's?
>> >>> >>
>> >>> >> Thanks in advance for any comments/suggestions...!
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <joelsolr@gmail.com
>> >
>> >>> >> wrote:
>> >>> >>
>> >>> >> > This is going to be a complex answer because Solr actually now has
>> >>> >> multiple
>> >>> >> > ways of doing regression analysis as part of the Streaming
>> >>> Expression
>> >>> >> > statistical programming library. The basic documentation is here:
>> >>> >> >
>> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
>> >>> ming.html
>> >>> >> >
>> >>> >> > Here is a sample expression that performs a simple linear
>> >>> regression in
>> >>> >> > Solr 7.2:
>> >>> >> >
>> >>> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>> >>> >> > fieldB"),
>> >>> >> >     b=col(a, fieldA),
>> >>> >> >     c=col(a, fieldB),
>> >>> >> >     d=regress(b, c))
>> >>> >> >
>> >>> >> >
>> >>> >> > The expression above takes a random sample of 15000 results from
>> >>> >> > collection1. The result set will include fieldA and fieldB in each
>> >>> >> record.
>> >>> >> > The result set is stored in variable "a".
>> >>> >> >
>> >>> >> > Then the "col" function creates arrays of numbers from the results
>> >>> >> stored
>> >>> >> > in variable a. The values in fieldA are stored in the variable
>> "b".
>> >>> The
>> >>> >> > values in fieldB are stored in variable "c".
>> >>> >> >
>> >>> >> > Then the regress function performs a simple linear regression on
>> >>> arrays
>> >>> >> > stored in variables "b" and "c".
>> >>> >> >
>> >>> >> > The output of the regress function is a map containing the
>> >>> regression
>> >>> >> > result. This result includes RSquared and other attributes of the
>> >>> >> > regression model such as R (correlation), slope, y intercept
>> etc...
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > Joel Bernstein
>> >>> >> > http://joelsolr.blogspot.com/
>> >>> >> >
>> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <localdevjs@gmail.com
>> >
>> >>> >> wrote:
>> >>> >> >
>> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but
>> >>> the
>> >>> >> end
>> >>> >> > > result of all this is supposed to be obtaining R^2. Is there no
>> >>> way of
>> >>> >> > > obtaining this value, then (short of iterating over all the
>> >>> results in
>> >>> >> > the
>> >>> >> > > hitlist and calculating it myself)?
>> >>> >> > >
>> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
>> >>> joelsolr@gmail.com>
>> >>> >> > > wrote:
>> >>> >> > >
>> >>> >> > > > Typically SSE is the sum of the squared errors of the
>> >>> prediction in
>> >>> >> a
>> >>> >> > > > regression analysis. The stats component doesn't perform
>> >>> regression,
>> >>> >> > > > although it might be a nice feature.
>> >>> >> > > >
>> >>> >> > > >
>> >>> >> > > >
>> >>> >> > > > Joel Bernstein
>> >>> >> > > > http://joelsolr.blogspot.com/
>> >>> >> > > >
>> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
>> >>> localdevjs@gmail.com>
>> >>> >> > > wrote:
>> >>> >> > > >
>> >>> >> > > > > I'm using solr, and enabling stats as per this page:
>> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
>> component
>> >>> .html
>> >>> >> > > > >
>> >>> >> > > > > I want to get more stat values though. Specifically I'm
>> >>> looking
>> >>> >> for
>> >>> >> > > > > r-squared (coefficient of determination). This value is not
>> >>> >> present
>> >>> >> > in
>> >>> >> > > > > solr, however some of the pieces used to calculate r^2 are
>> in
>> >>> the
>> >>> >> > stats
>> >>> >> > > > > element, for example:
>> >>> >> > > > >
>> >>> >> > > > > <double name="min">0.0</double>
>> >>> >> > > > > <double name="max">10.0</double>
>> >>> >> > > > > <long name="count">15</long>
>> >>> >> > > > > <long name="missing">17</long>
>> >>> >> > > > > <double name="sum">85.0</double>
>> >>> >> > > > > <double name="sumOfSquares">603.0</double>
>> >>> >> > > > > <double name="mean">5.666666666666667</double>
>> >>> >> > > > > <double name="stddev">2.943920288775949</double>
>> >>> >> > > > >
>> >>> >> > > > >
>> >>> >> > > > > So I have the sumOfSquares available (SST), and using this
>> >>> >> > > calculation, I
>> >>> >> > > > > can get R^2:
>> >>> >> > > > >
>> >>> >> > > > > R^2 = 1 - SSE/SST
>> >>> >> > > > >
>> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
>> >>> those
>> >>> >> > other
>> >>> >> > > > > stats in solr?
>> >>> >> > > > >
>> >>> >> > > > > Thanks in advance!
>> >>> >> > > > >
>> >>> >> > > >
>> >>> >> > >
>> >>> >> >
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>

Re: statistics in hitlist

Posted by John Smith <lo...@gmail.com>.
Hi Joel, I did some more work on this statistics stuff today. Yes, we do
have nulls in our data; the document contains many fields, we don't always
have values for each field, but we can't set the nulls to 0 either (or any
other value, really) as that will mess up other calculations (such as when
calculating average etc); we would normally just ignore fields with null
values when calculating stats manually ourselves.

Adding a check in the "q" parameter to ensure that the fields used in the
calculations are > 0 does work now. Thanks for the tip (and sorry, should
have caught that myself). But I am unable to use "fq" for these checks,
they have to be added to the q instead. Adding fq's doesn't have any effect.


Anyway, I'm trying to change this up a little. This is what I'm currently
using (switched from "random" to "search" since I actually need the full
hitlist not just a random subset):

let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="1500000",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
     b=col(a, oil_first_90_days_production),
     c=col(a, oil_last_30_days_production),
     d=regress(b, c))

So I have 2 fields there defined, that works great (in terms of a test and
running the query); but I need to replace the second field,
"oil_last_30_days_production" with the avg value in
oil_first_90_days_production.

I can get the avg with this expression:
stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="1500000", avg(oil_first_90_days_production))

But I don't know how to push that avg value into the first streaming
expression; guessing I have to set "c=...." but that is where I'm getting
lost, since avg only returns 1 value and the first parameter, "b", returns
a list of sorts. Somehow I have to get the avg value stuffed inside a
"col", where it is the same value for every row in the hitlist...?

Thanks for your help!


On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <jo...@gmail.com> wrote:

> I suspect you've got nulls in your data. I just tested with null values and
> got the same error. For testing purposes try loading the data with default
> values of zero.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > Let's break the expression down and build it up slowly. Let's start with:
> >
> > let(echo="true",
> >      a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >      b=col(a, oil_first_90_days_production))
> >
> >
> > This should return variables a and b. Let's see what the data looks like.
> > I changed the rows from 15 to 15000. If it all looks good we can expand
> the
> > rows and continue adding functions.
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com> wrote:
> >
> >> Thanks Joel for your help on this.
> >>
> >> What I've done so far:
> >> - unzip downloaded solr-7.2
> >> - modify the _default "managed-schema" to add the random field type and
> >> the dynamic random field
> >> - start solr7 using "solr start -c"
> >> - indexed my data using pint/pdouble/boolean field types etc
> >>
> >> I can now run the random function all by itself, it returns random
> >> results as expected. So far so good!
> >>
> >> However... now trying to get the regression stuff working:
> >>
> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15000", fl="oil_first_90_days_producti
> >> on,oil_last_30_days_production"),
> >>     b=col(a, oil_first_90_days_production),
> >>     c=col(a, oil_last_30_days_production),
> >>     d=regress(b, c))
> >>
> >> Posted directly into solr admin UI. Run the streaming expression and I
> >> get this error message:
> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> >> expected but found type java.lang.String for value
> >> oil_first_90_days_production"
> >>
> >> It thinks my numeric field is defined as a string? But when I view the
> >> schema, those 2 fields are defined as ints:
> >>
> >>
> >> When I run a normal query and choose xml as output format, then it also
> >> puts "int" elements into the hitlist, so the schema appears to be
> correct
> >> it's just when using this regress function that something goes wrong and
> >> solr thinks the field is string.
> >>
> >> Any suggestions?
> >> Thanks!
> >> ​
> >>
> >>
> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >>
> >>> The field type will also need to be in the schema:
> >>>
> >>>  <!-- The "RandomSortField" is not used to store or search any
> >>>
> >>>          data.  You can declare fields of this type it in your schema
> >>>
> >>>          to generate pseudo-random orderings of your docs for sorting
> >>>
> >>>          or function purposes.  The ordering is generated based on the
> >>> field
> >>>
> >>>          name and the version of the index. As long as the index
> version
> >>>
> >>>          remains unchanged, and the same field name is reused,
> >>>
> >>>          the ordering of the docs will be consistent.
> >>>
> >>>          If you want different psuedo-random orderings of documents,
> >>>
> >>>          for the same version of the index, use a dynamicField and
> >>>
> >>>          change the field name in the request.
> >>>
> >>>      -->
> >>>
> >>> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
> >>>
> >>>
> >>> Joel Bernstein
> >>> http://joelsolr.blogspot.com/
> >>>
> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com>
> >>> wrote:
> >>>
> >>> > You'll need to have this field in your schema:
> >>> >
> >>> > <dynamicField name="random_*" type="random" />
> >>> >
> >>> > I'll check to see if the default schema used with solr start -c has
> >>> this
> >>> > field, if not I'll add it. Thanks for pointing this out.
> >>> >
> >>> > I checked and right now the random expression is only accepting one
> fq,
> >>> > but I consider this a bug. It should accept multiple. I'll create
> >>> ticket
> >>> > for getting this fixed.
> >>> >
> >>> >
> >>> >
> >>> > Joel Bernstein
> >>> > http://joelsolr.blogspot.com/
> >>> >
> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> Joel, thanks for the pointers to the streaming feature. I had no
> idea
> >>> solr
> >>> >> had that (and also just discovered the very intersting sql feature!
> I
> >>> will
> >>> >> be sure to investigate that in more detail in the future).
> >>> >>
> >>> >> However I'm having some trouble getting basic streaming functions
> >>> working.
> >>> >> I've already figured out that I had to move to "solr cloud" instead
> of
> >>> >> "solr standalone" because I was getting errors about "cannot find zk
> >>> >> instance" or whatever which went away when using "solr start -c"
> >>> instead.
> >>> >>
> >>> >> But now I'm trying to use the random function since that was one of
> >>> the
> >>> >> functions used in your example.
> >>> >>
> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>> >>
> >>> >> I posted that directly in the "stream" section of the solr admin UI.
> >>> This
> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions
> in
> >>> case
> >>> >> it was a bug in one)
> >>> >>
> >>> >> I get back an error message:
> >>> >> *sort param could not be parsed as a query, and is not a field that
> >>> exists
> >>> >> in the index: random_-255009774*
> >>> >>
> >>> >> I'm not passing in any sort field anywhere. But the solr logs show
> >>> these
> >>> >> three log entries:
> >>> >>
> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
> s:shard1
> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
> >>> status=400
> >>> >> QTime=19
> >>> >>
> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
> s:shard1
> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >>> o.a.s.c.s.i.CloudSolrClient
> >>> >> Request to collection [tx_header] failed due to (400)
> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> RemoteSolrException:
> >>> >> Error
> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> >>> could
> >>> >> not be parsed as a query, and is not a field that exists in the
> index:
> >>> >> random_-255009774, retry? 0
> >>> >>
> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
> s:shard1
> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >>> o.a.s.c.s.i.s.ExceptionStream
> >>> >> java.io.IOException:
> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> RemoteSolrException:
> >>> >> Error
> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> >>> could
> >>> >> not be parsed as a query, and is not a field that exists in the
> index:
> >>> >> random_-255009774
> >>> >>
> >>> >>
> >>> >> So basically it looks like solr is injecting the "sort=random_"
> stuff
> >>> into
> >>> >> my query and of course that is failing on the search since that
> >>> >> field/column doesn't exist in my schema. Everytime I run the random
> >>> >> function, I get a slightly different field name that it injects, but
> >>> they
> >>> >> all start with "random_" etc.
> >>> >>
> >>> >> I have tried adding my own sort field instead, hoping solr wouldn't
> >>> inject
> >>> >> one for me, but it still injected a random sort fieldname:
> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> >>> sort="countyname
> >>> >> asc")
> >>> >>
> >>> >>
> >>> >> Assuming I can fix that whole problem, my second question is: can I
> >>> add
> >>> >> multiple "fq=" parameters to the random function? I build a pretty
> >>> >> complicated query using many fq= fields, and then want to run some
> >>> stats
> >>> >> on
> >>> >> that hitlist; so somehow I have to pass in the query that made up
> the
> >>> >> exact
> >>> >> hitlist to these various functions, but when I used multiple "fq="
> >>> values
> >>> >> it only seemed to use the last one I specified and just ignored all
> >>> the
> >>> >> previous fq's?
> >>> >>
> >>> >> Thanks in advance for any comments/suggestions...!
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <joelsolr@gmail.com
> >
> >>> >> wrote:
> >>> >>
> >>> >> > This is going to be a complex answer because Solr actually now has
> >>> >> multiple
> >>> >> > ways of doing regression analysis as part of the Streaming
> >>> Expression
> >>> >> > statistical programming library. The basic documentation is here:
> >>> >> >
> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
> >>> ming.html
> >>> >> >
> >>> >> > Here is a sample expression that performs a simple linear
> >>> regression in
> >>> >> > Solr 7.2:
> >>> >> >
> >>> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> >>> >> > fieldB"),
> >>> >> >     b=col(a, fieldA),
> >>> >> >     c=col(a, fieldB),
> >>> >> >     d=regress(b, c))
> >>> >> >
> >>> >> >
> >>> >> > The expression above takes a random sample of 15000 results from
> >>> >> > collection1. The result set will include fieldA and fieldB in each
> >>> >> record.
> >>> >> > The result set is stored in variable "a".
> >>> >> >
> >>> >> > Then the "col" function creates arrays of numbers from the results
> >>> >> stored
> >>> >> > in variable a. The values in fieldA are stored in the variable
> "b".
> >>> The
> >>> >> > values in fieldB are stored in variable "c".
> >>> >> >
> >>> >> > Then the regress function performs a simple linear regression on
> >>> arrays
> >>> >> > stored in variables "b" and "c".
> >>> >> >
> >>> >> > The output of the regress function is a map containing the
> >>> regression
> >>> >> > result. This result includes RSquared and other attributes of the
> >>> >> > regression model such as R (correlation), slope, y intercept
> etc...
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > Joel Bernstein
> >>> >> > http://joelsolr.blogspot.com/
> >>> >> >
> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <localdevjs@gmail.com
> >
> >>> >> wrote:
> >>> >> >
> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but
> >>> the
> >>> >> end
> >>> >> > > result of all this is supposed to be obtaining R^2. Is there no
> >>> way of
> >>> >> > > obtaining this value, then (short of iterating over all the
> >>> results in
> >>> >> > the
> >>> >> > > hitlist and calculating it myself)?
> >>> >> > >
> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> >>> joelsolr@gmail.com>
> >>> >> > > wrote:
> >>> >> > >
> >>> >> > > > Typically SSE is the sum of the squared errors of the
> >>> prediction in
> >>> >> a
> >>> >> > > > regression analysis. The stats component doesn't perform
> >>> regression,
> >>> >> > > > although it might be a nice feature.
> >>> >> > > >
> >>> >> > > >
> >>> >> > > >
> >>> >> > > > Joel Bernstein
> >>> >> > > > http://joelsolr.blogspot.com/
> >>> >> > > >
> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> >>> localdevjs@gmail.com>
> >>> >> > > wrote:
> >>> >> > > >
> >>> >> > > > > I'm using solr, and enabling stats as per this page:
> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> component
> >>> .html
> >>> >> > > > >
> >>> >> > > > > I want to get more stat values though. Specifically I'm
> >>> looking
> >>> >> for
> >>> >> > > > > r-squared (coefficient of determination). This value is not
> >>> >> present
> >>> >> > in
> >>> >> > > > > solr, however some of the pieces used to calculate r^2 are
> in
> >>> the
> >>> >> > stats
> >>> >> > > > > element, for example:
> >>> >> > > > >
> >>> >> > > > > <double name="min">0.0</double>
> >>> >> > > > > <double name="max">10.0</double>
> >>> >> > > > > <long name="count">15</long>
> >>> >> > > > > <long name="missing">17</long>
> >>> >> > > > > <double name="sum">85.0</double>
> >>> >> > > > > <double name="sumOfSquares">603.0</double>
> >>> >> > > > > <double name="mean">5.666666666666667</double>
> >>> >> > > > > <double name="stddev">2.943920288775949</double>
> >>> >> > > > >
> >>> >> > > > >
> >>> >> > > > > So I have the sumOfSquares available (SST), and using this
> >>> >> > > calculation, I
> >>> >> > > > > can get R^2:
> >>> >> > > > >
> >>> >> > > > > R^2 = 1 - SSE/SST
> >>> >> > > > >
> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
> >>> those
> >>> >> > other
> >>> >> > > > > stats in solr?
> >>> >> > > > >
> >>> >> > > > > Thanks in advance!
> >>> >> > > > >
> >>> >> > > >
> >>> >> > >
> >>> >> >
> >>> >>
> >>> >
> >>> >
> >>>
> >>
> >>
> >
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
I suspect you've got nulls in your data. I just tested with null values and
got the same error. For testing purposes try loading the data with default
values of zero.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Let's break the expression down and build it up slowly. Let's start with:
>
> let(echo="true",
>      a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> fl="oil_first_90_days_production,oil_last_30_days_production"),
>      b=col(a, oil_first_90_days_production))
>
>
> This should return variables a and b. Let's see what the data looks like.
> I changed the rows from 15 to 15000. If it all looks good we can expand the
> rows and continue adding functions.
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com> wrote:
>
>> Thanks Joel for your help on this.
>>
>> What I've done so far:
>> - unzip downloaded solr-7.2
>> - modify the _default "managed-schema" to add the random field type and
>> the dynamic random field
>> - start solr7 using "solr start -c"
>> - indexed my data using pint/pdouble/boolean field types etc
>>
>> I can now run the random function all by itself, it returns random
>> results as expected. So far so good!
>>
>> However... now trying to get the regression stuff working:
>>
>> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> rows="15000", fl="oil_first_90_days_producti
>> on,oil_last_30_days_production"),
>>     b=col(a, oil_first_90_days_production),
>>     c=col(a, oil_last_30_days_production),
>>     d=regress(b, c))
>>
>> Posted directly into solr admin UI. Run the streaming expression and I
>> get this error message:
>> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
>> expected but found type java.lang.String for value
>> oil_first_90_days_production"
>>
>> It thinks my numeric field is defined as a string? But when I view the
>> schema, those 2 fields are defined as ints:
>>
>>
>> When I run a normal query and choose xml as output format, then it also
>> puts "int" elements into the hitlist, so the schema appears to be correct
>> it's just when using this regress function that something goes wrong and
>> solr thinks the field is string.
>>
>> Any suggestions?
>> Thanks!
>> ​
>>
>>
>> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>>> The field type will also need to be in the schema:
>>>
>>>  <!-- The "RandomSortField" is not used to store or search any
>>>
>>>          data.  You can declare fields of this type it in your schema
>>>
>>>          to generate pseudo-random orderings of your docs for sorting
>>>
>>>          or function purposes.  The ordering is generated based on the
>>> field
>>>
>>>          name and the version of the index. As long as the index version
>>>
>>>          remains unchanged, and the same field name is reused,
>>>
>>>          the ordering of the docs will be consistent.
>>>
>>>          If you want different psuedo-random orderings of documents,
>>>
>>>          for the same version of the index, use a dynamicField and
>>>
>>>          change the field name in the request.
>>>
>>>      -->
>>>
>>> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com>
>>> wrote:
>>>
>>> > You'll need to have this field in your schema:
>>> >
>>> > <dynamicField name="random_*" type="random" />
>>> >
>>> > I'll check to see if the default schema used with solr start -c has
>>> this
>>> > field, if not I'll add it. Thanks for pointing this out.
>>> >
>>> > I checked and right now the random expression is only accepting one fq,
>>> > but I consider this a bug. It should accept multiple. I'll create
>>> ticket
>>> > for getting this fixed.
>>> >
>>> >
>>> >
>>> > Joel Bernstein
>>> > http://joelsolr.blogspot.com/
>>> >
>>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com>
>>> wrote:
>>> >
>>> >> Joel, thanks for the pointers to the streaming feature. I had no idea
>>> solr
>>> >> had that (and also just discovered the very intersting sql feature! I
>>> will
>>> >> be sure to investigate that in more detail in the future).
>>> >>
>>> >> However I'm having some trouble getting basic streaming functions
>>> working.
>>> >> I've already figured out that I had to move to "solr cloud" instead of
>>> >> "solr standalone" because I was getting errors about "cannot find zk
>>> >> instance" or whatever which went away when using "solr start -c"
>>> instead.
>>> >>
>>> >> But now I'm trying to use the random function since that was one of
>>> the
>>> >> functions used in your example.
>>> >>
>>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>>> >>
>>> >> I posted that directly in the "stream" section of the solr admin UI.
>>> This
>>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
>>> case
>>> >> it was a bug in one)
>>> >>
>>> >> I get back an error message:
>>> >> *sort param could not be parsed as a query, and is not a field that
>>> exists
>>> >> in the index: random_-255009774*
>>> >>
>>> >> I'm not passing in any sort field anywhere. But the solr logs show
>>> these
>>> >> three log entries:
>>> >>
>>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
>>> status=400
>>> >> QTime=19
>>> >>
>>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>>> o.a.s.c.s.i.CloudSolrClient
>>> >> Request to collection [tx_header] failed due to (400)
>>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> >> Error
>>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>>> could
>>> >> not be parsed as a query, and is not a field that exists in the index:
>>> >> random_-255009774, retry? 0
>>> >>
>>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
>>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>>> o.a.s.c.s.i.s.ExceptionStream
>>> >> java.io.IOException:
>>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> >> Error
>>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>>> could
>>> >> not be parsed as a query, and is not a field that exists in the index:
>>> >> random_-255009774
>>> >>
>>> >>
>>> >> So basically it looks like solr is injecting the "sort=random_" stuff
>>> into
>>> >> my query and of course that is failing on the search since that
>>> >> field/column doesn't exist in my schema. Everytime I run the random
>>> >> function, I get a slightly different field name that it injects, but
>>> they
>>> >> all start with "random_" etc.
>>> >>
>>> >> I have tried adding my own sort field instead, hoping solr wouldn't
>>> inject
>>> >> one for me, but it still injected a random sort fieldname:
>>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
>>> sort="countyname
>>> >> asc")
>>> >>
>>> >>
>>> >> Assuming I can fix that whole problem, my second question is: can I
>>> add
>>> >> multiple "fq=" parameters to the random function? I build a pretty
>>> >> complicated query using many fq= fields, and then want to run some
>>> stats
>>> >> on
>>> >> that hitlist; so somehow I have to pass in the query that made up the
>>> >> exact
>>> >> hitlist to these various functions, but when I used multiple "fq="
>>> values
>>> >> it only seemed to use the last one I specified and just ignored all
>>> the
>>> >> previous fq's?
>>> >>
>>> >> Thanks in advance for any comments/suggestions...!
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > This is going to be a complex answer because Solr actually now has
>>> >> multiple
>>> >> > ways of doing regression analysis as part of the Streaming
>>> Expression
>>> >> > statistical programming library. The basic documentation is here:
>>> >> >
>>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
>>> ming.html
>>> >> >
>>> >> > Here is a sample expression that performs a simple linear
>>> regression in
>>> >> > Solr 7.2:
>>> >> >
>>> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>>> >> > fieldB"),
>>> >> >     b=col(a, fieldA),
>>> >> >     c=col(a, fieldB),
>>> >> >     d=regress(b, c))
>>> >> >
>>> >> >
>>> >> > The expression above takes a random sample of 15000 results from
>>> >> > collection1. The result set will include fieldA and fieldB in each
>>> >> record.
>>> >> > The result set is stored in variable "a".
>>> >> >
>>> >> > Then the "col" function creates arrays of numbers from the results
>>> >> stored
>>> >> > in variable a. The values in fieldA are stored in the variable "b".
>>> The
>>> >> > values in fieldB are stored in variable "c".
>>> >> >
>>> >> > Then the regress function performs a simple linear regression on
>>> arrays
>>> >> > stored in variables "b" and "c".
>>> >> >
>>> >> > The output of the regress function is a map containing the
>>> regression
>>> >> > result. This result includes RSquared and other attributes of the
>>> >> > regression model such as R (correlation), slope, y intercept etc...
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > Joel Bernstein
>>> >> > http://joelsolr.blogspot.com/
>>> >> >
>>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but
>>> the
>>> >> end
>>> >> > > result of all this is supposed to be obtaining R^2. Is there no
>>> way of
>>> >> > > obtaining this value, then (short of iterating over all the
>>> results in
>>> >> > the
>>> >> > > hitlist and calculating it myself)?
>>> >> > >
>>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
>>> joelsolr@gmail.com>
>>> >> > > wrote:
>>> >> > >
>>> >> > > > Typically SSE is the sum of the squared errors of the
>>> prediction in
>>> >> a
>>> >> > > > regression analysis. The stats component doesn't perform
>>> regression,
>>> >> > > > although it might be a nice feature.
>>> >> > > >
>>> >> > > >
>>> >> > > >
>>> >> > > > Joel Bernstein
>>> >> > > > http://joelsolr.blogspot.com/
>>> >> > > >
>>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
>>> localdevjs@gmail.com>
>>> >> > > wrote:
>>> >> > > >
>>> >> > > > > I'm using solr, and enabling stats as per this page:
>>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component
>>> .html
>>> >> > > > >
>>> >> > > > > I want to get more stat values though. Specifically I'm
>>> looking
>>> >> for
>>> >> > > > > r-squared (coefficient of determination). This value is not
>>> >> present
>>> >> > in
>>> >> > > > > solr, however some of the pieces used to calculate r^2 are in
>>> the
>>> >> > stats
>>> >> > > > > element, for example:
>>> >> > > > >
>>> >> > > > > <double name="min">0.0</double>
>>> >> > > > > <double name="max">10.0</double>
>>> >> > > > > <long name="count">15</long>
>>> >> > > > > <long name="missing">17</long>
>>> >> > > > > <double name="sum">85.0</double>
>>> >> > > > > <double name="sumOfSquares">603.0</double>
>>> >> > > > > <double name="mean">5.666666666666667</double>
>>> >> > > > > <double name="stddev">2.943920288775949</double>
>>> >> > > > >
>>> >> > > > >
>>> >> > > > > So I have the sumOfSquares available (SST), and using this
>>> >> > > calculation, I
>>> >> > > > > can get R^2:
>>> >> > > > >
>>> >> > > > > R^2 = 1 - SSE/SST
>>> >> > > > >
>>> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
>>> those
>>> >> > other
>>> >> > > > > stats in solr?
>>> >> > > > >
>>> >> > > > > Thanks in advance!
>>> >> > > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
Let's break the expression down and build it up slowly. Let's start with:

let(echo="true",
     a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
fl="oil_first_90_days_production,oil_last_30_days_production"),
     b=col(a, oil_first_90_days_production))


This should return variables a and b. Let's see what the data looks like. I
changed the rows from 15 to 15000. If it all looks good we can expand the
rows and continue adding functions.




Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 5, 2018 at 4:11 PM, John Smith <lo...@gmail.com> wrote:

> Thanks Joel for your help on this.
>
> What I've done so far:
> - unzip downloaded solr-7.2
> - modify the _default "managed-schema" to add the random field type and
> the dynamic random field
> - start solr7 using "solr start -c"
> - indexed my data using pint/pdouble/boolean field types etc
>
> I can now run the random function all by itself, it returns random results
> as expected. So far so good!
>
> However... now trying to get the regression stuff working:
>
> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> rows="15000", fl="oil_first_90_days_production,oil_last_30_days_
> production"),
>     b=col(a, oil_first_90_days_production),
>     c=col(a, oil_last_30_days_production),
>     d=regress(b, c))
>
> Posted directly into solr admin UI. Run the streaming expression and I get
> this error message:
> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> expected but found type java.lang.String for value
> oil_first_90_days_production"
>
> It thinks my numeric field is defined as a string? But when I view the
> schema, those 2 fields are defined as ints:
>
>
> When I run a normal query and choose xml as output format, then it also
> puts "int" elements into the hitlist, so the schema appears to be correct
> it's just when using this regress function that something goes wrong and
> solr thinks the field is string.
>
> Any suggestions?
> Thanks!
> ​
>
>
> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com> wrote:
>
>> The field type will also need to be in the schema:
>>
>>  <!-- The "RandomSortField" is not used to store or search any
>>
>>          data.  You can declare fields of this type it in your schema
>>
>>          to generate pseudo-random orderings of your docs for sorting
>>
>>          or function purposes.  The ordering is generated based on the
>> field
>>
>>          name and the version of the index. As long as the index version
>>
>>          remains unchanged, and the same field name is reused,
>>
>>          the ordering of the docs will be consistent.
>>
>>          If you want different psuedo-random orderings of documents,
>>
>>          for the same version of the index, use a dynamicField and
>>
>>          change the field name in the request.
>>
>>      -->
>>
>> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>> > You'll need to have this field in your schema:
>> >
>> > <dynamicField name="random_*" type="random" />
>> >
>> > I'll check to see if the default schema used with solr start -c has this
>> > field, if not I'll add it. Thanks for pointing this out.
>> >
>> > I checked and right now the random expression is only accepting one fq,
>> > but I consider this a bug. It should accept multiple. I'll create ticket
>> > for getting this fixed.
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com>
>> wrote:
>> >
>> >> Joel, thanks for the pointers to the streaming feature. I had no idea
>> solr
>> >> had that (and also just discovered the very intersting sql feature! I
>> will
>> >> be sure to investigate that in more detail in the future).
>> >>
>> >> However I'm having some trouble getting basic streaming functions
>> working.
>> >> I've already figured out that I had to move to "solr cloud" instead of
>> >> "solr standalone" because I was getting errors about "cannot find zk
>> >> instance" or whatever which went away when using "solr start -c"
>> instead.
>> >>
>> >> But now I'm trying to use the random function since that was one of the
>> >> functions used in your example.
>> >>
>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>> >>
>> >> I posted that directly in the "stream" section of the solr admin UI.
>> This
>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
>> case
>> >> it was a bug in one)
>> >>
>> >> I get back an error message:
>> >> *sort param could not be parsed as a query, and is not a field that
>> exists
>> >> in the index: random_-255009774*
>> >>
>> >> I'm not passing in any sort field anywhere. But the solr logs show
>> these
>> >> three log entries:
>> >>
>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
>> status=400
>> >> QTime=19
>> >>
>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
>> >> Request to collection [tx_header] failed due to (400)
>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> >> Error
>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> could
>> >> not be parsed as a query, and is not a field that exists in the index:
>> >> random_-255009774, retry? 0
>> >>
>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> >> r:core_node2 x:tx_header_shard1_replica_n1]
>> o.a.s.c.s.i.s.ExceptionStream
>> >> java.io.IOException:
>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> >> Error
>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> could
>> >> not be parsed as a query, and is not a field that exists in the index:
>> >> random_-255009774
>> >>
>> >>
>> >> So basically it looks like solr is injecting the "sort=random_" stuff
>> into
>> >> my query and of course that is failing on the search since that
>> >> field/column doesn't exist in my schema. Everytime I run the random
>> >> function, I get a slightly different field name that it injects, but
>> they
>> >> all start with "random_" etc.
>> >>
>> >> I have tried adding my own sort field instead, hoping solr wouldn't
>> inject
>> >> one for me, but it still injected a random sort fieldname:
>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
>> sort="countyname
>> >> asc")
>> >>
>> >>
>> >> Assuming I can fix that whole problem, my second question is: can I add
>> >> multiple "fq=" parameters to the random function? I build a pretty
>> >> complicated query using many fq= fields, and then want to run some
>> stats
>> >> on
>> >> that hitlist; so somehow I have to pass in the query that made up the
>> >> exact
>> >> hitlist to these various functions, but when I used multiple "fq="
>> values
>> >> it only seemed to use the last one I specified and just ignored all the
>> >> previous fq's?
>> >>
>> >> Thanks in advance for any comments/suggestions...!
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com>
>> >> wrote:
>> >>
>> >> > This is going to be a complex answer because Solr actually now has
>> >> multiple
>> >> > ways of doing regression analysis as part of the Streaming Expression
>> >> > statistical programming library. The basic documentation is here:
>> >> >
>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-program
>> ming.html
>> >> >
>> >> > Here is a sample expression that performs a simple linear regression
>> in
>> >> > Solr 7.2:
>> >> >
>> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>> >> > fieldB"),
>> >> >     b=col(a, fieldA),
>> >> >     c=col(a, fieldB),
>> >> >     d=regress(b, c))
>> >> >
>> >> >
>> >> > The expression above takes a random sample of 15000 results from
>> >> > collection1. The result set will include fieldA and fieldB in each
>> >> record.
>> >> > The result set is stored in variable "a".
>> >> >
>> >> > Then the "col" function creates arrays of numbers from the results
>> >> stored
>> >> > in variable a. The values in fieldA are stored in the variable "b".
>> The
>> >> > values in fieldB are stored in variable "c".
>> >> >
>> >> > Then the regress function performs a simple linear regression on
>> arrays
>> >> > stored in variables "b" and "c".
>> >> >
>> >> > The output of the regress function is a map containing the regression
>> >> > result. This result includes RSquared and other attributes of the
>> >> > regression model such as R (correlation), slope, y intercept etc...
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Joel Bernstein
>> >> > http://joelsolr.blogspot.com/
>> >> >
>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com>
>> >> wrote:
>> >> >
>> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the
>> >> end
>> >> > > result of all this is supposed to be obtaining R^2. Is there no
>> way of
>> >> > > obtaining this value, then (short of iterating over all the
>> results in
>> >> > the
>> >> > > hitlist and calculating it myself)?
>> >> > >
>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
>> joelsolr@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Typically SSE is the sum of the squared errors of the prediction
>> in
>> >> a
>> >> > > > regression analysis. The stats component doesn't perform
>> regression,
>> >> > > > although it might be a nice feature.
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > Joel Bernstein
>> >> > > > http://joelsolr.blogspot.com/
>> >> > > >
>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
>> localdevjs@gmail.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > I'm using solr, and enabling stats as per this page:
>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component
>> .html
>> >> > > > >
>> >> > > > > I want to get more stat values though. Specifically I'm looking
>> >> for
>> >> > > > > r-squared (coefficient of determination). This value is not
>> >> present
>> >> > in
>> >> > > > > solr, however some of the pieces used to calculate r^2 are in
>> the
>> >> > stats
>> >> > > > > element, for example:
>> >> > > > >
>> >> > > > > <double name="min">0.0</double>
>> >> > > > > <double name="max">10.0</double>
>> >> > > > > <long name="count">15</long>
>> >> > > > > <long name="missing">17</long>
>> >> > > > > <double name="sum">85.0</double>
>> >> > > > > <double name="sumOfSquares">603.0</double>
>> >> > > > > <double name="mean">5.666666666666667</double>
>> >> > > > > <double name="stddev">2.943920288775949</double>
>> >> > > > >
>> >> > > > >
>> >> > > > > So I have the sumOfSquares available (SST), and using this
>> >> > > calculation, I
>> >> > > > > can get R^2:
>> >> > > > >
>> >> > > > > R^2 = 1 - SSE/SST
>> >> > > > >
>> >> > > > > All I need then is SSE. Is there anyway I can get SSE from
>> those
>> >> > other
>> >> > > > > stats in solr?
>> >> > > > >
>> >> > > > > Thanks in advance!
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: statistics in hitlist

Posted by John Smith <lo...@gmail.com>.
Thanks Joel for your help on this.

What I've done so far:
- unzip downloaded solr-7.2
- modify the _default "managed-schema" to add the random field type and the
dynamic random field
- start solr7 using "solr start -c"
- indexed my data using pint/pdouble/boolean field types etc

I can now run the random function all by itself, it returns random results
as expected. So far so good!

However... now trying to get the regression stuff working:

let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000",
fl="oil_first_90_days_production,oil_last_30_days_production"),
    b=col(a, oil_first_90_days_production),
    c=col(a, oil_last_30_days_production),
    d=regress(b, c))

Posted directly into solr admin UI. Run the streaming expression and I get
this error message:
"EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
expected but found type java.lang.String for value
oil_first_90_days_production"

It thinks my numeric field is defined as a string? But when I view the
schema, those 2 fields are defined as ints:


When I run a normal query and choose xml as output format, then it also
puts "int" elements into the hitlist, so the schema appears to be correct
it's just when using this regress function that something goes wrong and
solr thinks the field is string.

Any suggestions?
Thanks!
​


On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <jo...@gmail.com> wrote:

> The field type will also need to be in the schema:
>
>  <!-- The "RandomSortField" is not used to store or search any
>
>          data.  You can declare fields of this type it in your schema
>
>          to generate pseudo-random orderings of your docs for sorting
>
>          or function purposes.  The ordering is generated based on the
> field
>
>          name and the version of the index. As long as the index version
>
>          remains unchanged, and the same field name is reused,
>
>          the ordering of the docs will be consistent.
>
>          If you want different psuedo-random orderings of documents,
>
>          for the same version of the index, use a dynamicField and
>
>          change the field name in the request.
>
>      -->
>
> <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com> wrote:
>
> > You'll need to have this field in your schema:
> >
> > <dynamicField name="random_*" type="random" />
> >
> > I'll check to see if the default schema used with solr start -c has this
> > field, if not I'll add it. Thanks for pointing this out.
> >
> > I checked and right now the random expression is only accepting one fq,
> > but I consider this a bug. It should accept multiple. I'll create ticket
> > for getting this fixed.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com> wrote:
> >
> >> Joel, thanks for the pointers to the streaming feature. I had no idea
> solr
> >> had that (and also just discovered the very intersting sql feature! I
> will
> >> be sure to investigate that in more detail in the future).
> >>
> >> However I'm having some trouble getting basic streaming functions
> working.
> >> I've already figured out that I had to move to "solr cloud" instead of
> >> "solr standalone" because I was getting errors about "cannot find zk
> >> instance" or whatever which went away when using "solr start -c"
> instead.
> >>
> >> But now I'm trying to use the random function since that was one of the
> >> functions used in your example.
> >>
> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>
> >> I posted that directly in the "stream" section of the solr admin UI.
> This
> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
> case
> >> it was a bug in one)
> >>
> >> I get back an error message:
> >> *sort param could not be parsed as a query, and is not a field that
> exists
> >> in the index: random_-255009774*
> >>
> >> I'm not passing in any sort field anywhere. But the solr logs show these
> >> three log entries:
> >>
> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
> >> QTime=19
> >>
> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> >> Request to collection [tx_header] failed due to (400)
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774, retry? 0
> >>
> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1]
> o.a.s.c.s.i.s.ExceptionStream
> >> java.io.IOException:
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774
> >>
> >>
> >> So basically it looks like solr is injecting the "sort=random_" stuff
> into
> >> my query and of course that is failing on the search since that
> >> field/column doesn't exist in my schema. Everytime I run the random
> >> function, I get a slightly different field name that it injects, but
> they
> >> all start with "random_" etc.
> >>
> >> I have tried adding my own sort field instead, hoping solr wouldn't
> inject
> >> one for me, but it still injected a random sort fieldname:
> >> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
> >> asc")
> >>
> >>
> >> Assuming I can fix that whole problem, my second question is: can I add
> >> multiple "fq=" parameters to the random function? I build a pretty
> >> complicated query using many fq= fields, and then want to run some stats
> >> on
> >> that hitlist; so somehow I have to pass in the query that made up the
> >> exact
> >> hitlist to these various functions, but when I used multiple "fq="
> values
> >> it only seemed to use the last one I specified and just ignored all the
> >> previous fq's?
> >>
> >> Thanks in advance for any comments/suggestions...!
> >>
> >>
> >>
> >>
> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com>
> >> wrote:
> >>
> >> > This is going to be a complex answer because Solr actually now has
> >> multiple
> >> > ways of doing regression analysis as part of the Streaming Expression
> >> > statistical programming library. The basic documentation is here:
> >> >
> >> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
> >> >
> >> > Here is a sample expression that performs a simple linear regression
> in
> >> > Solr 7.2:
> >> >
> >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> >> > fieldB"),
> >> >     b=col(a, fieldA),
> >> >     c=col(a, fieldB),
> >> >     d=regress(b, c))
> >> >
> >> >
> >> > The expression above takes a random sample of 15000 results from
> >> > collection1. The result set will include fieldA and fieldB in each
> >> record.
> >> > The result set is stored in variable "a".
> >> >
> >> > Then the "col" function creates arrays of numbers from the results
> >> stored
> >> > in variable a. The values in fieldA are stored in the variable "b".
> The
> >> > values in fieldB are stored in variable "c".
> >> >
> >> > Then the regress function performs a simple linear regression on
> arrays
> >> > stored in variables "b" and "c".
> >> >
> >> > The output of the regress function is a map containing the regression
> >> > result. This result includes RSquared and other attributes of the
> >> > regression model such as R (correlation), slope, y intercept etc...
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com>
> >> wrote:
> >> >
> >> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the
> >> end
> >> > > result of all this is supposed to be obtaining R^2. Is there no way
> of
> >> > > obtaining this value, then (short of iterating over all the results
> in
> >> > the
> >> > > hitlist and calculating it myself)?
> >> > >
> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> joelsolr@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Typically SSE is the sum of the squared errors of the prediction
> in
> >> a
> >> > > > regression analysis. The stats component doesn't perform
> regression,
> >> > > > although it might be a nice feature.
> >> > > >
> >> > > >
> >> > > >
> >> > > > Joel Bernstein
> >> > > > http://joelsolr.blogspot.com/
> >> > > >
> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> localdevjs@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > I'm using solr, and enabling stats as per this page:
> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> component.html
> >> > > > >
> >> > > > > I want to get more stat values though. Specifically I'm looking
> >> for
> >> > > > > r-squared (coefficient of determination). This value is not
> >> present
> >> > in
> >> > > > > solr, however some of the pieces used to calculate r^2 are in
> the
> >> > stats
> >> > > > > element, for example:
> >> > > > >
> >> > > > > <double name="min">0.0</double>
> >> > > > > <double name="max">10.0</double>
> >> > > > > <long name="count">15</long>
> >> > > > > <long name="missing">17</long>
> >> > > > > <double name="sum">85.0</double>
> >> > > > > <double name="sumOfSquares">603.0</double>
> >> > > > > <double name="mean">5.666666666666667</double>
> >> > > > > <double name="stddev">2.943920288775949</double>
> >> > > > >
> >> > > > >
> >> > > > > So I have the sumOfSquares available (SST), and using this
> >> > > calculation, I
> >> > > > > can get R^2:
> >> > > > >
> >> > > > > R^2 = 1 - SSE/SST
> >> > > > >
> >> > > > > All I need then is SSE. Is there anyway I can get SSE from those
> >> > other
> >> > > > > stats in solr?
> >> > > > >
> >> > > > > Thanks in advance!
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
The field type will also need to be in the schema:

 <!-- The "RandomSortField" is not used to store or search any

         data.  You can declare fields of this type it in your schema

         to generate pseudo-random orderings of your docs for sorting

         or function purposes.  The ordering is generated based on the field

         name and the version of the index. As long as the index version

         remains unchanged, and the same field name is reused,

         the ordering of the docs will be consistent.

         If you want different psuedo-random orderings of documents,

         for the same version of the index, use a dynamicField and

         change the field name in the request.

     -->

<fieldType name="random" class="solr.RandomSortField" indexed="true" />


Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <jo...@gmail.com> wrote:

> You'll need to have this field in your schema:
>
> <dynamicField name="random_*" type="random" />
>
> I'll check to see if the default schema used with solr start -c has this
> field, if not I'll add it. Thanks for pointing this out.
>
> I checked and right now the random expression is only accepting one fq,
> but I consider this a bug. It should accept multiple. I'll create ticket
> for getting this fixed.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com> wrote:
>
>> Joel, thanks for the pointers to the streaming feature. I had no idea solr
>> had that (and also just discovered the very intersting sql feature! I will
>> be sure to investigate that in more detail in the future).
>>
>> However I'm having some trouble getting basic streaming functions working.
>> I've already figured out that I had to move to "solr cloud" instead of
>> "solr standalone" because I was getting errors about "cannot find zk
>> instance" or whatever which went away when using "solr start -c" instead.
>>
>> But now I'm trying to use the random function since that was one of the
>> functions used in your example.
>>
>> random(tx_header, q="*:*", rows="100", fl="countyname")
>>
>> I posted that directly in the "stream" section of the solr admin UI. This
>> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
>> it was a bug in one)
>>
>> I get back an error message:
>> *sort param could not be parsed as a query, and is not a field that exists
>> in the index: random_-255009774*
>>
>> I'm not passing in any sort field anywhere. But the solr logs show these
>> three log entries:
>>
>> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
>> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
>> QTime=19
>>
>> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
>> Request to collection [tx_header] failed due to (400)
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error
>> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
>> not be parsed as a query, and is not a field that exists in the index:
>> random_-255009774, retry? 0
>>
>> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
>> java.io.IOException:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error
>> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
>> not be parsed as a query, and is not a field that exists in the index:
>> random_-255009774
>>
>>
>> So basically it looks like solr is injecting the "sort=random_" stuff into
>> my query and of course that is failing on the search since that
>> field/column doesn't exist in my schema. Everytime I run the random
>> function, I get a slightly different field name that it injects, but they
>> all start with "random_" etc.
>>
>> I have tried adding my own sort field instead, hoping solr wouldn't inject
>> one for me, but it still injected a random sort fieldname:
>> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
>> asc")
>>
>>
>> Assuming I can fix that whole problem, my second question is: can I add
>> multiple "fq=" parameters to the random function? I build a pretty
>> complicated query using many fq= fields, and then want to run some stats
>> on
>> that hitlist; so somehow I have to pass in the query that made up the
>> exact
>> hitlist to these various functions, but when I used multiple "fq=" values
>> it only seemed to use the last one I specified and just ignored all the
>> previous fq's?
>>
>> Thanks in advance for any comments/suggestions...!
>>
>>
>>
>>
>> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com>
>> wrote:
>>
>> > This is going to be a complex answer because Solr actually now has
>> multiple
>> > ways of doing regression analysis as part of the Streaming Expression
>> > statistical programming library. The basic documentation is here:
>> >
>> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>> >
>> > Here is a sample expression that performs a simple linear regression in
>> > Solr 7.2:
>> >
>> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>> > fieldB"),
>> >     b=col(a, fieldA),
>> >     c=col(a, fieldB),
>> >     d=regress(b, c))
>> >
>> >
>> > The expression above takes a random sample of 15000 results from
>> > collection1. The result set will include fieldA and fieldB in each
>> record.
>> > The result set is stored in variable "a".
>> >
>> > Then the "col" function creates arrays of numbers from the results
>> stored
>> > in variable a. The values in fieldA are stored in the variable "b". The
>> > values in fieldB are stored in variable "c".
>> >
>> > Then the regress function performs a simple linear regression on arrays
>> > stored in variables "b" and "c".
>> >
>> > The output of the regress function is a map containing the regression
>> > result. This result includes RSquared and other attributes of the
>> > regression model such as R (correlation), slope, y intercept etc...
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com>
>> wrote:
>> >
>> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the
>> end
>> > > result of all this is supposed to be obtaining R^2. Is there no way of
>> > > obtaining this value, then (short of iterating over all the results in
>> > the
>> > > hitlist and calculating it myself)?
>> > >
>> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <jo...@gmail.com>
>> > > wrote:
>> > >
>> > > > Typically SSE is the sum of the squared errors of the prediction in
>> a
>> > > > regression analysis. The stats component doesn't perform regression,
>> > > > although it might be a nice feature.
>> > > >
>> > > >
>> > > >
>> > > > Joel Bernstein
>> > > > http://joelsolr.blogspot.com/
>> > > >
>> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > I'm using solr, and enabling stats as per this page:
>> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
>> > > > >
>> > > > > I want to get more stat values though. Specifically I'm looking
>> for
>> > > > > r-squared (coefficient of determination). This value is not
>> present
>> > in
>> > > > > solr, however some of the pieces used to calculate r^2 are in the
>> > stats
>> > > > > element, for example:
>> > > > >
>> > > > > <double name="min">0.0</double>
>> > > > > <double name="max">10.0</double>
>> > > > > <long name="count">15</long>
>> > > > > <long name="missing">17</long>
>> > > > > <double name="sum">85.0</double>
>> > > > > <double name="sumOfSquares">603.0</double>
>> > > > > <double name="mean">5.666666666666667</double>
>> > > > > <double name="stddev">2.943920288775949</double>
>> > > > >
>> > > > >
>> > > > > So I have the sumOfSquares available (SST), and using this
>> > > calculation, I
>> > > > > can get R^2:
>> > > > >
>> > > > > R^2 = 1 - SSE/SST
>> > > > >
>> > > > > All I need then is SSE. Is there anyway I can get SSE from those
>> > other
>> > > > > stats in solr?
>> > > > >
>> > > > > Thanks in advance!
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
You'll need to have this field in your schema:

<dynamicField name="random_*" type="random" />

I'll check to see if the default schema used with solr start -c has this
field, if not I'll add it. Thanks for pointing this out.

I checked and right now the random expression is only accepting one fq, but
I consider this a bug. It should accept multiple. I'll create ticket for
getting this fixed.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 1, 2018 at 4:55 PM, John Smith <lo...@gmail.com> wrote:

> Joel, thanks for the pointers to the streaming feature. I had no idea solr
> had that (and also just discovered the very intersting sql feature! I will
> be sure to investigate that in more detail in the future).
>
> However I'm having some trouble getting basic streaming functions working.
> I've already figured out that I had to move to "solr cloud" instead of
> "solr standalone" because I was getting errors about "cannot find zk
> instance" or whatever which went away when using "solr start -c" instead.
>
> But now I'm trying to use the random function since that was one of the
> functions used in your example.
>
> random(tx_header, q="*:*", rows="100", fl="countyname")
>
> I posted that directly in the "stream" section of the solr admin UI. This
> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
> it was a bug in one)
>
> I get back an error message:
> *sort param could not be parsed as a query, and is not a field that exists
> in the index: random_-255009774*
>
> I'm not passing in any sort field anywhere. But the solr logs show these
> three log entries:
>
> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
> QTime=19
>
> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> Request to collection [tx_header] failed due to (400)
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
> not be parsed as a query, and is not a field that exists in the index:
> random_-255009774, retry? 0
>
> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
> java.io.IOException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
> not be parsed as a query, and is not a field that exists in the index:
> random_-255009774
>
>
> So basically it looks like solr is injecting the "sort=random_" stuff into
> my query and of course that is failing on the search since that
> field/column doesn't exist in my schema. Everytime I run the random
> function, I get a slightly different field name that it injects, but they
> all start with "random_" etc.
>
> I have tried adding my own sort field instead, hoping solr wouldn't inject
> one for me, but it still injected a random sort fieldname:
> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
> asc")
>
>
> Assuming I can fix that whole problem, my second question is: can I add
> multiple "fq=" parameters to the random function? I build a pretty
> complicated query using many fq= fields, and then want to run some stats on
> that hitlist; so somehow I have to pass in the query that made up the exact
> hitlist to these various functions, but when I used multiple "fq=" values
> it only seemed to use the last one I specified and just ignored all the
> previous fq's?
>
> Thanks in advance for any comments/suggestions...!
>
>
>
>
> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > This is going to be a complex answer because Solr actually now has
> multiple
> > ways of doing regression analysis as part of the Streaming Expression
> > statistical programming library. The basic documentation is here:
> >
> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
> >
> > Here is a sample expression that performs a simple linear regression in
> > Solr 7.2:
> >
> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> > fieldB"),
> >     b=col(a, fieldA),
> >     c=col(a, fieldB),
> >     d=regress(b, c))
> >
> >
> > The expression above takes a random sample of 15000 results from
> > collection1. The result set will include fieldA and fieldB in each
> record.
> > The result set is stored in variable "a".
> >
> > Then the "col" function creates arrays of numbers from the results stored
> > in variable a. The values in fieldA are stored in the variable "b". The
> > values in fieldB are stored in variable "c".
> >
> > Then the regress function performs a simple linear regression on arrays
> > stored in variables "b" and "c".
> >
> > The output of the regress function is a map containing the regression
> > result. This result includes RSquared and other attributes of the
> > regression model such as R (correlation), slope, y intercept etc...
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com>
> wrote:
> >
> > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
> > > result of all this is supposed to be obtaining R^2. Is there no way of
> > > obtaining this value, then (short of iterating over all the results in
> > the
> > > hitlist and calculating it myself)?
> > >
> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <jo...@gmail.com>
> > > wrote:
> > >
> > > > Typically SSE is the sum of the squared errors of the prediction in a
> > > > regression analysis. The stats component doesn't perform regression,
> > > > although it might be a nice feature.
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com>
> > > wrote:
> > > >
> > > > > I'm using solr, and enabling stats as per this page:
> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> > > > >
> > > > > I want to get more stat values though. Specifically I'm looking for
> > > > > r-squared (coefficient of determination). This value is not present
> > in
> > > > > solr, however some of the pieces used to calculate r^2 are in the
> > stats
> > > > > element, for example:
> > > > >
> > > > > <double name="min">0.0</double>
> > > > > <double name="max">10.0</double>
> > > > > <long name="count">15</long>
> > > > > <long name="missing">17</long>
> > > > > <double name="sum">85.0</double>
> > > > > <double name="sumOfSquares">603.0</double>
> > > > > <double name="mean">5.666666666666667</double>
> > > > > <double name="stddev">2.943920288775949</double>
> > > > >
> > > > >
> > > > > So I have the sumOfSquares available (SST), and using this
> > > calculation, I
> > > > > can get R^2:
> > > > >
> > > > > R^2 = 1 - SSE/SST
> > > > >
> > > > > All I need then is SSE. Is there anyway I can get SSE from those
> > other
> > > > > stats in solr?
> > > > >
> > > > > Thanks in advance!
> > > > >
> > > >
> > >
> >
>

Re: statistics in hitlist

Posted by John Smith <lo...@gmail.com>.
Joel, thanks for the pointers to the streaming feature. I had no idea solr
had that (and also just discovered the very intersting sql feature! I will
be sure to investigate that in more detail in the future).

However I'm having some trouble getting basic streaming functions working.
I've already figured out that I had to move to "solr cloud" instead of
"solr standalone" because I was getting errors about "cannot find zk
instance" or whatever which went away when using "solr start -c" instead.

But now I'm trying to use the random function since that was one of the
functions used in your example.

random(tx_header, q="*:*", rows="100", fl="countyname")

I posted that directly in the "stream" section of the solr admin UI. This
is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
it was a bug in one)

I get back an error message:
*sort param could not be parsed as a query, and is not a field that exists
in the index: random_-255009774*

I'm not passing in any sort field anywhere. But the solr logs show these
three log entries:

2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
[tx_header_shard1_replica_n1]  webapp=/solr path=/select
params={q=*:*&_stateVer_=tx_header:6&fl=countyname
*&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400
QTime=19

2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
Request to collection [tx_header] failed due to (400)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774, retry? 0

2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
java.io.IOException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774


So basically it looks like solr is injecting the "sort=random_" stuff into
my query and of course that is failing on the search since that
field/column doesn't exist in my schema. Everytime I run the random
function, I get a slightly different field name that it injects, but they
all start with "random_" etc.

I have tried adding my own sort field instead, hoping solr wouldn't inject
one for me, but it still injected a random sort fieldname:
random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
asc")


Assuming I can fix that whole problem, my second question is: can I add
multiple "fq=" parameters to the random function? I build a pretty
complicated query using many fq= fields, and then want to run some stats on
that hitlist; so somehow I have to pass in the query that made up the exact
hitlist to these various functions, but when I used multiple "fq=" values
it only seemed to use the last one I specified and just ignored all the
previous fq's?

Thanks in advance for any comments/suggestions...!




On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <jo...@gmail.com> wrote:

> This is going to be a complex answer because Solr actually now has multiple
> ways of doing regression analysis as part of the Streaming Expression
> statistical programming library. The basic documentation is here:
>
> https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>
> Here is a sample expression that performs a simple linear regression in
> Solr 7.2:
>
> let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> fieldB"),
>     b=col(a, fieldA),
>     c=col(a, fieldB),
>     d=regress(b, c))
>
>
> The expression above takes a random sample of 15000 results from
> collection1. The result set will include fieldA and fieldB in each record.
> The result set is stored in variable "a".
>
> Then the "col" function creates arrays of numbers from the results stored
> in variable a. The values in fieldA are stored in the variable "b". The
> values in fieldB are stored in variable "c".
>
> Then the regress function performs a simple linear regression on arrays
> stored in variables "b" and "c".
>
> The output of the regress function is a map containing the regression
> result. This result includes RSquared and other attributes of the
> regression model such as R (correlation), slope, y intercept etc...
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com> wrote:
>
> > Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
> > result of all this is supposed to be obtaining R^2. Is there no way of
> > obtaining this value, then (short of iterating over all the results in
> the
> > hitlist and calculating it myself)?
> >
> > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > Typically SSE is the sum of the squared errors of the prediction in a
> > > regression analysis. The stats component doesn't perform regression,
> > > although it might be a nice feature.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com>
> > wrote:
> > >
> > > > I'm using solr, and enabling stats as per this page:
> > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> > > >
> > > > I want to get more stat values though. Specifically I'm looking for
> > > > r-squared (coefficient of determination). This value is not present
> in
> > > > solr, however some of the pieces used to calculate r^2 are in the
> stats
> > > > element, for example:
> > > >
> > > > <double name="min">0.0</double>
> > > > <double name="max">10.0</double>
> > > > <long name="count">15</long>
> > > > <long name="missing">17</long>
> > > > <double name="sum">85.0</double>
> > > > <double name="sumOfSquares">603.0</double>
> > > > <double name="mean">5.666666666666667</double>
> > > > <double name="stddev">2.943920288775949</double>
> > > >
> > > >
> > > > So I have the sumOfSquares available (SST), and using this
> > calculation, I
> > > > can get R^2:
> > > >
> > > > R^2 = 1 - SSE/SST
> > > >
> > > > All I need then is SSE. Is there anyway I can get SSE from those
> other
> > > > stats in solr?
> > > >
> > > > Thanks in advance!
> > > >
> > >
> >
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
This is going to be a complex answer because Solr actually now has multiple
ways of doing regression analysis as part of the Streaming Expression
statistical programming library. The basic documentation is here:

https://lucene.apache.org/solr/guide/7_2/statistical-programming.html

Here is a sample expression that performs a simple linear regression in
Solr 7.2:

let(a=random(collection1, q="any query", rows="15000", fl="fieldA, fieldB"),
    b=col(a, fieldA),
    c=col(a, fieldB),
    d=regress(b, c))


The expression above takes a random sample of 15000 results from
collection1. The result set will include fieldA and fieldB in each record.
The result set is stored in variable "a".

Then the "col" function creates arrays of numbers from the results stored
in variable a. The values in fieldA are stored in the variable "b". The
values in fieldB are stored in variable "c".

Then the regress function performs a simple linear regression on arrays
stored in variables "b" and "c".

The output of the regress function is a map containing the regression
result. This result includes RSquared and other attributes of the
regression model such as R (correlation), slope, y intercept etc...









Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Feb 23, 2018 at 3:10 PM, John Smith <lo...@gmail.com> wrote:

> Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
> result of all this is supposed to be obtaining R^2. Is there no way of
> obtaining this value, then (short of iterating over all the results in the
> hitlist and calculating it myself)?
>
> On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > Typically SSE is the sum of the squared errors of the prediction in a
> > regression analysis. The stats component doesn't perform regression,
> > although it might be a nice feature.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com>
> wrote:
> >
> > > I'm using solr, and enabling stats as per this page:
> > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> > >
> > > I want to get more stat values though. Specifically I'm looking for
> > > r-squared (coefficient of determination). This value is not present in
> > > solr, however some of the pieces used to calculate r^2 are in the stats
> > > element, for example:
> > >
> > > <double name="min">0.0</double>
> > > <double name="max">10.0</double>
> > > <long name="count">15</long>
> > > <long name="missing">17</long>
> > > <double name="sum">85.0</double>
> > > <double name="sumOfSquares">603.0</double>
> > > <double name="mean">5.666666666666667</double>
> > > <double name="stddev">2.943920288775949</double>
> > >
> > >
> > > So I have the sumOfSquares available (SST), and using this
> calculation, I
> > > can get R^2:
> > >
> > > R^2 = 1 - SSE/SST
> > >
> > > All I need then is SSE. Is there anyway I can get SSE from those other
> > > stats in solr?
> > >
> > > Thanks in advance!
> > >
> >
>

Re: statistics in hitlist

Posted by John Smith <lo...@gmail.com>.
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
result of all this is supposed to be obtaining R^2. Is there no way of
obtaining this value, then (short of iterating over all the results in the
hitlist and calculating it myself)?

On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Typically SSE is the sum of the squared errors of the prediction in a
> regression analysis. The stats component doesn't perform regression,
> although it might be a nice feature.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com> wrote:
>
> > I'm using solr, and enabling stats as per this page:
> > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> >
> > I want to get more stat values though. Specifically I'm looking for
> > r-squared (coefficient of determination). This value is not present in
> > solr, however some of the pieces used to calculate r^2 are in the stats
> > element, for example:
> >
> > <double name="min">0.0</double>
> > <double name="max">10.0</double>
> > <long name="count">15</long>
> > <long name="missing">17</long>
> > <double name="sum">85.0</double>
> > <double name="sumOfSquares">603.0</double>
> > <double name="mean">5.666666666666667</double>
> > <double name="stddev">2.943920288775949</double>
> >
> >
> > So I have the sumOfSquares available (SST), and using this calculation, I
> > can get R^2:
> >
> > R^2 = 1 - SSE/SST
> >
> > All I need then is SSE. Is there anyway I can get SSE from those other
> > stats in solr?
> >
> > Thanks in advance!
> >
>

Re: statistics in hitlist

Posted by Joel Bernstein <jo...@gmail.com>.
Typically SSE is the sum of the squared errors of the prediction in a
regression analysis. The stats component doesn't perform regression,
although it might be a nice feature.



Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Feb 23, 2018 at 12:17 PM, John Smith <lo...@gmail.com> wrote:

> I'm using solr, and enabling stats as per this page:
> https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
>
> I want to get more stat values though. Specifically I'm looking for
> r-squared (coefficient of determination). This value is not present in
> solr, however some of the pieces used to calculate r^2 are in the stats
> element, for example:
>
> <double name="min">0.0</double>
> <double name="max">10.0</double>
> <long name="count">15</long>
> <long name="missing">17</long>
> <double name="sum">85.0</double>
> <double name="sumOfSquares">603.0</double>
> <double name="mean">5.666666666666667</double>
> <double name="stddev">2.943920288775949</double>
>
>
> So I have the sumOfSquares available (SST), and using this calculation, I
> can get R^2:
>
> R^2 = 1 - SSE/SST
>
> All I need then is SSE. Is there anyway I can get SSE from those other
> stats in solr?
>
> Thanks in advance!
>