You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ct...@apache.org on 2021/01/07 04:43:32 UTC

[lucene-solr] 01/02: Remove old doc left behind in branch merge; fix children list to pass the build

This is an automated email from the ASF dual-hosted git repository.

ctargett pushed a commit to branch jira/solr-13105-toMerge
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git

commit eb6298091b0705c53b148ebeb149d2b21ddfa652
Author: Cassandra Targett <ct...@apache.org>
AuthorDate: Wed Jan 6 21:43:46 2021 -0600

    Remove old doc left behind in branch merge; fix children list to pass the build
---
 solr/solr-ref-guide/src/math-expressions.adoc |   2 +-
 solr/solr-ref-guide/src/vectorization.adoc    | 383 --------------------------
 2 files changed, 1 insertion(+), 384 deletions(-)

diff --git a/solr/solr-ref-guide/src/math-expressions.adoc b/solr/solr-ref-guide/src/math-expressions.adoc
index 3554c90..343696e 100644
--- a/solr/solr-ref-guide/src/math-expressions.adoc
+++ b/solr/solr-ref-guide/src/math-expressions.adoc
@@ -1,5 +1,5 @@
 = Streaming Expressions and Math Expressions
-:page-children: visualization, math-start, loading, search-sample, transform, scalar-math, vector-math, variables, matrix-math, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry
+:page-children: visualization, math-start, loading, search-sample, transform, scalar-math, vector-math, variables, matrix-math, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry, logs
 
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc
deleted file mode 100644
index 26a6f60..0000000
--- a/solr/solr-ref-guide/src/vectorization.adoc
+++ /dev/null
@@ -1,383 +0,0 @@
-= Streams and Vectorization
-// Licensed to the Apache Software Foundation (ASF) under one
-// or more contributor license agreements.  See the NOTICE file
-// distributed with this work for additional information
-// regarding copyright ownership.  The ASF licenses this file
-// to you under the Apache License, Version 2.0 (the
-// "License"); you may not use this file except in compliance
-// with the License.  You may obtain a copy of the License at
-//
-//   http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing,
-// software distributed under the License is distributed on an
-// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-// KIND, either express or implied.  See the License for the
-// specific language governing permissions and limitations
-// under the License.
-
-This section of the user guide explores techniques
-for retrieving streams of data from Solr and vectorizing the
-numeric fields.
-
-See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
-vectorize text fields.
-
-== Streams
-
-Streaming Expressions has a wide range of stream sources that can be used to
-retrieve data from SolrCloud collections. Math expressions can be used
-to vectorize and analyze the results sets.
-
-Below are some of the key stream sources:
-
-* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
-co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
-under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
-the aggregated results can be pivoted into a co-occurance matrix which can be mined for
-correlations and hidden similarities within the data.
-
-* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
-The `random` function returns a random sample of search results that match a
-query. The random samples can be vectorized and operated on by math expressions and the results
-can be used to describe and make inferences about the entire population.
-
-* *`timeseries`*: The `timeseries`
-expression provides fast distributed time series aggregations, which can be
-vectorized and analyzed with math expressions.
-
-* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
-function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
-a distributed index. Once the nearest neighbors are retrieved they can be vectorized
-and operated on by machine learning and text mining algorithms.
-
-* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
-data retrieval using a subset of SQL which includes both full text search and
-fast distributed aggregations. The result sets can then be vectorized and operated
-on by math expressions.
-
-* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
-streams originating from Solr. Result sets from outside data sources can be vectorized and operated
-on by math expressions in the same manner as result sets originating from Solr.
-
-* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
-function provides publish/subscribe messaging capabilities by treating
-SolrCloud as a distributed message queue. Topics are extremely powerful
-because they allow subscription by query. Topics can be use to support a broad set of
-use cases including bulk text mining operations and AI alerting.
-
-* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
-machine learning tool. The `nodes` function provides fast, distributed, breadth
-first graph traversal over documents in a SolrCloud collection. The node sets collected
-by the `nodes` function can be operated on by statistical and machine learning expressions to
-gain more insight into the graph.
-
-* *`search`*: Ranked search results are a powerful tool for finding the most relevant
-documents from a large document corpus. The `search` expression
-returns the top N ranked search results that match any
-Solr query, including geo-spatial queries. The smaller set of relevant
-documents can then be explored with statistical, machine learning and
-text mining expressions to gather insights about the data set.
-
-== Assigning Streams to Variables
-
-The output of any streaming expression can be set to a variable.
-Below is a very simple example using the `random` function to fetch
-three random samples from collection1. The random samples are returned
-as tuples which contain name/value pairs.
-
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "a": [
-          {
-            "price_f": 0.7927976
-          },
-          {
-            "price_f": 0.060795486
-          },
-          {
-            "price_f": 0.55128294
-          }
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 11
-      }
-    ]
-  }
-}
-----
-
-== Creating a Vector with the col Function
-
-The `col` function iterates over a list of tuples and copies the values
-from a specific column into an array.
-
-The output of the `col` function is an numeric array that can be set to a
-variable and operated on by math expressions.
-
-Below is an example of the `col` function:
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
-    b=col(a, price_f))
-----
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "b": [
-          0.42105234,
-          0.85237443,
-          0.7566981
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 9
-      }
-    ]
-  }
-}
-----
-
-== Applying Math Expressions to the Vector
-
-Once a vector has been created any math expression that operates on vectors
-can be applied. In the example below the `mean` function is applied to
-the vector assigned to variable *`b`*.
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
-    b=col(a, price_f),
-    c=mean(b))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "c": 0.5016035594638814
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 306
-      }
-    ]
-  }
-}
-----
-
-== Creating Matrices
-
-Matrices can be created by vectorizing multiple numeric fields
-and adding them to a matrix. The matrices can then be operated on by
-any math expression that operates on matrices.
-
-[TIP]
-====
-Note that this section deals with the creation of matrices
-from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
-====
-
-Below is a simple example where four random samples are taken
-from different sub-populations in the data. The `price_f` field of
-each random sample is
-vectorized and the vectors are added as rows to a matrix.
-Then the `sumRows`
-function is applied to the matrix to return a vector containing
-the sum of each row.
-
-[source,text]
-----
-let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
-    b=random(collection1, q="market:B", rows="5000", fl="price_f"),
-    c=random(collection1, q="market:C", rows="5000", fl="price_f"),
-    d=random(collection1, q="market:D", rows="5000", fl="price_f"),
-    e=col(a, price_f),
-    f=col(b, price_f),
-    g=col(c, price_f),
-    h=col(d, price_f),
-    i=matrix(e, f, g, h),
-    j=sumRows(i))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "j": [
-          154390.1293375,
-          167434.89453,
-          159293.258493,
-          149773.42769,
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 9
-      }
-    ]
-  }
-}
-----
-
-== Facet Co-occurrence Matrices
-
-The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from
-records stored in a SolrCloud collection. These multi-dimension aggregations can represent co-occurrence
-counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
-aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
-correlations to learn about the hidden connections within the data.
-
-In the example below the `facet` expression is used to generate a two dimensional faceted aggregation.
-The first dimension is the US State that a car was purchased in and the second dimension is the car model.
-This two dimensional facet generates the co-occurrence counts for the number of times a particular car model
-was purchased in a particular state.
-
-
-[source,text]
-----
-facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "state": "NY",
-        "model": "camry",
-        "count(*)": 13342
-      },
-      {
-        "state": "NJ",
-        "model": "accord",
-        "count(*)": 13002
-      },
-      {
-        "state": "NY",
-        "model": "civic",
-        "count(*)": 12901
-      },
-      {
-        "state": "CA",
-        "model": "focus",
-        "count(*)": 12892
-      },
-      {
-        "state": "TX",
-        "model": "f150",
-        "count(*)": 12871
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 171
-      }
-    ]
-  }
-}
-----
-
-The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
-The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
-columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
- from the facet results.  Once the co-occurrence matrix has been created the US States can be clustered
-by car model, or the matrix can be transposed and car models can be clustered by the US States
-where they were bought.
-
-[source,text]
-----
-let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
-    b=pivot(a, state, model, count(*)),
-    c=kmeans(b, 7))
-----
-
-== Latitude / Longitude Vectors
-
-The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
-a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
-pair for the corresponding tuple in the list. The row labels for the matrix are
-automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
-on by distance-based machine learning functions using the `haversineMeters` distance measure.
-
-The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
-`field`, which tells the `latlonVectors` function which field to parse the lat/lon
-vectors from.
-
-Below is an example of the `latlonVectors`.
-
-[source,text]
-----
-let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
-    b=latlonVectors(a, field="loc_p"))
-----
-
-When this expression is sent to the `/stream` handler it responds with:
-
-[source,json]
-----
-{
-  "result-set": {
-    "docs": [
-      {
-        "b": [
-          [
-            42.87183530723629,
-            76.74102353397778
-          ],
-          [
-            42.91372904094898,
-            76.72874889228416
-          ],
-          [
-            42.911528804897564,
-            76.70537292977619
-          ],
-          [
-            42.91143870500213,
-            76.74749913047408
-          ],
-          [
-            42.904666267479705,
-            76.73933236046092
-          ]
-        ]
-      },
-      {
-        "EOF": true,
-        "RESPONSE_TIME": 21
-      }
-    ]
-  }
-}
-----