You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jb...@apache.org on 2019/06/15 23:09:49 UTC
[lucene-solr] branch SOLR-13105-visual updated: SOLR-13105: Add search-sample.adoc

This is an automated email from the ASF dual-hosted git repository.

jbernste pushed a commit to branch SOLR-13105-visual
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git


The following commit(s) were added to refs/heads/SOLR-13105-visual by this push:
     new 25606c0  SOLR-13105: Add search-sample.adoc
25606c0 is described below

commit 25606c061473ca2e9341e9fbe53bf4ecb4d05402
Author: Joel Bernstein <jb...@apache.org>
AuthorDate: Sat Jun 15 19:09:30 2019 -0400

    SOLR-13105: Add search-sample.adoc
---
 solr/solr-ref-guide/src/math-expressions.adoc |   6 +-
 solr/solr-ref-guide/src/search-sample.adoc    |  33 +++
 solr/solr-ref-guide/src/vectorization.adoc    | 383 --------------------------
 3 files changed, 36 insertions(+), 386 deletions(-)

diff --git a/solr/solr-ref-guide/src/math-expressions.adoc b/solr/solr-ref-guide/src/math-expressions.adoc
index 3f7d46f..8d664ed 100644
--- a/solr/solr-ref-guide/src/math-expressions.adoc
+++ b/solr/solr-ref-guide/src/math-expressions.adoc
@@ -1,5 +1,5 @@
 = Math Expressions
-:page-children: visualization, math-start, scalar-math, vector-math, variables, matrix-math, vectorization, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry
+:page-children: visualization, math-start, scalar-math, vector-math, variables, matrix-math, search-sample, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry
 
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
@@ -27,9 +27,9 @@ image::images/math-expressions/curve-fitting.png[]
 
 *<<visualization.adoc#visualization,Visualizations>>*: Gallery of Streaming Expressions and Math Expressions visualizations.
 
-*<<math-start.adoc#getting-started,Getting Started>>*: Getting started with Streaming Expressions, Math Expressions and Visualization.
+*<<math-start.adoc#getting-started,Getting Started>>*: Getting started with Streaming Expressions, Math Expressions and visualization.
 
-*<<matrix-math.adoc#search-sample,Searching, Sampling and Aggregation>>*:  Searching, random sampling, aggregation and visualization of result sets.
+*<<matrix-math.adoc#search-sample,Searching, Sampling and Aggregation>>*: Searching, random sampling and aggregation Streaming Expressions and visualization.
 
 *<<scalar-math.adoc#scalar-math,Scalar Math>>*: Math functions and visualization applied to numbers.
 
diff --git a/solr/solr-ref-guide/src/search-sample.adoc b/solr/solr-ref-guide/src/search-sample.adoc
new file mode 100644
index 0000000..e341557
--- /dev/null
+++ b/solr/solr-ref-guide/src/search-sample.adoc
@@ -0,0 +1,33 @@
+= Searching, Sampling and Aggregation
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+This section of the user guide explores techniques
+for retrieving streams of data from Solr and vectorizing the
+numeric fields.
+
+See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
+vectorize text fields.
+
+== Searching
+
+
+== Sampling
+
+
+== Aggregations
+
diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc
deleted file mode 100644
index 5c08a58..0000000
--- a/solr/solr-ref-guide/src/vectorization.adoc
+++ /dev/null
@@ -1,383 +0,0 @@
-= Streams and Vectorization
-// Licensed to the Apache Software Foundation (ASF) under one
-// or more contributor license agreements.  See the NOTICE file
-// distributed with this work for additional information
-// regarding copyright ownership.  The ASF licenses this file
-// to you under the Apache License, Version 2.0 (the
-// "License"); you may not use this file except in compliance
-// with the License.  You may obtain a copy of the License at
-//
-//   http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing,
-// software distributed under the License is distributed on an
-// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-// KIND, either express or implied.  See the License for the
-// specific language governing permissions and limitations
-// under the License.
-
-This section of the user guide explores techniques
-for retrieving streams of data from Solr and vectorizing the
-numeric fields.
-
-See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
-vectorize text fields.
-
-== Streams
-
-Streaming Expressions has a wide range of stream sources that can be used to
-retrieve data from Solr Cloud collections. Math expressions can be used
-to vectorize and analyze the results sets.
-
-Below are some of the key stream sources:
-
-* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
-co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
-under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
-the aggregated results can be pivoted into a co-occurance matrix which can be mined for
-correlations and hidden similarities within the data.
-
-* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
-The `random` function returns a random sample of search results that match a
-query. The random samples can be vectorized and operated on by math expressions and the results
-can be used to describe and make inferences about the entire population.
-
-* *`timeseries`*: The `timeseries`
-expression provides fast distributed time series aggregations, which can be
-vectorized and analyzed with math expressions.
-
-* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
-function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
-a distributed index. Once the nearest neighbors are retrieved they can be vectorized
-and operated on by machine learning and text mining algorithms.
-
-* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
-data retrieval using a subset of SQL which includes both full text search and
-fast distributed aggregations. The result sets can then be vectorized and operated
-on by math expressions.
-
-* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
-streams originating from Solr. Result sets from outside data sources can be vectorized and operated
-on by math expressions in the same manner as result sets originating from Solr.
-
-* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
-function provides publish/subscribe messaging capabilities by treating
-Solr Cloud as a distributed message queue. Topics are extremely powerful
-because they allow subscription by query. Topics can be use to support a broad set of
-use cases including bulk text mining operations and AI alerting.
-
-* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
-machine learning tool. The `nodes` function provides fast, distributed, breadth
-first graph traversal over documents in a Solr Cloud collection. The node sets collected
-by the `nodes` function can be operated on by statistical and machine learning expressions to
-gain more insight into the graph.
-
-* *`search`*: Ranked search results are a powerful tool for finding the most relevant
-documents from a large document corpus. The `search` expression
-returns the top N ranked search results that match any
-Solr query, including geo-spatial queries. The smaller set of relevant
-documents can then be explored with statistical, machine learning and
-text mining expressions to gather insights about the data set.
-
-== Assigning Streams to Variables
-
-The output of any streaming expression can be set to a variable.
-Below is a very simple example using the `random` function to fetch
-three random samples from collection1. The random samples are returned
-as tuples which contain name/value pairs.
-
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "a": [
-          {
-            "price_f": 0.7927976
-          },
-          {
-            "price_f": 0.060795486
-          },
-          {
-            "price_f": 0.55128294
-          }
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 11
-      }
-    ]
-  }
-}
-----
-
-== Creating a Vector with the col Function
-
-The `col` function iterates over a list of tuples and copies the values
-from a specific column into an array.
-
-The output of the `col` function is an numeric array that can be set to a
-variable and operated on by math expressions.
-
-Below is an example of the `col` function:
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
-    b=col(a, price_f))
-----
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "b": [
-          0.42105234,
-          0.85237443,
-          0.7566981
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 9
-      }
-    ]
-  }
-}
-----
-
-== Applying Math Expressions to the Vector
-
-Once a vector has been created any math expression that operates on vectors
-can be applied. In the example below the `mean` function is applied to
-the vector assigned to variable *`b`*.
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
-    b=col(a, price_f),
-    c=mean(b))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "c": 0.5016035594638814
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 306
-      }
-    ]
-  }
-}
-----
-
-== Creating Matrices
-
-Matrices can be created by vectorizing multiple numeric fields
-and adding them to a matrix. The matrices can then be operated on by
-any math expression that operates on matrices.
-
-[TIP]
-====
-Note that this section deals with the creation of matrices
-from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
-====
-
-Below is a simple example where four random samples are taken
-from different sub-populations in the data. The `price_f` field of
-each random sample is
-vectorized and the vectors are added as rows to a matrix.
-Then the `sumRows`
-function is applied to the matrix to return a vector containing
-the sum of each row.
-
-[source,text]
-----
-let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
-    b=random(collection1, q="market:B", rows="5000", fl="price_f"),
-    c=random(collection1, q="market:C", rows="5000", fl="price_f"),
-    d=random(collection1, q="market:D", rows="5000", fl="price_f"),
-    e=col(a, price_f),
-    f=col(b, price_f),
-    g=col(c, price_f),
-    h=col(d, price_f),
-    i=matrix(e, f, g, h),
-    j=sumRows(i))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "j": [
-          154390.1293375,
-          167434.89453,
-          159293.258493,
-          149773.42769,
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 9
-      }
-    ]
-  }
-}
-----
-
-== Facet Co-occurrence Matrices
-
-The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from
-records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
-counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
-aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
-correlations to learn about the hidden connections within the data.
-
-In the example below the `facet` expression is used to generate a two dimensional faceted aggregation.
-The first dimension is the US State that a car was purchased in and the second dimension is the car model.
-This two dimensional facet generates the co-occurrence counts for the number of times a particular car model
-was purchased in a particular state.
-
-
-[source,text]
-----
-facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "state": "NY",
-        "model": "camry",
-        "count(*)": 13342
-      },
-      {
-        "state": "NJ",
-        "model": "accord",
-        "count(*)": 13002
-      },
-      {
-        "state": "NY",
-        "model": "civic",
-        "count(*)": 12901
-      },
-      {
-        "state": "CA",
-        "model": "focus",
-        "count(*)": 12892
-      },
-      {
-        "state": "TX",
-        "model": "f150",
-        "count(*)": 12871
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 171
-      }
-    ]
-  }
-}
-----
-
-The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
-The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
-columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
- from the facet results.  Once the co-occurrence matrix has been created the US States can be clustered
-by car model, or the matrix can be transposed and car models can be clustered by the US States
-where they were bought.
-
-[source,text]
-----
-let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
-    b=pivot(a, state, model, count(*)),
-    c=kmeans(b, 7))
-----
-
-== Latitude / Longitude Vectors
-
-The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
-a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
-pair for the corresponding tuple in the list. The row labels for the matrix are
-automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
-on by distance-based machine learning functions using the `haversineMeters` distance measure.
-
-The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
-`field`, which tells the `latlonVectors` function which field to parse the lat/lon
-vectors from.
-
-Below is an example of the `latlonVectors`.
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
-    b=latlonVectors(a, field="loc_p"))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "b": [
-          [
-            42.87183530723629,
-            76.74102353397778
-          ],
-          [
-            42.91372904094898,
-            76.72874889228416
-          ],
-          [
-            42.911528804897564,
-            76.70537292977619
-          ],
-          [
-            42.91143870500213,
-            76.74749913047408
-          ],
-          [
-            42.904666267479705,
-            76.73933236046092
-          ]
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 21
-      }
-    ]
-  }
-}
-----