You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by da...@apache.org on 2018/11/12 11:55:30 UTC

[21/50] [abbrv] lucene-solr:jira/http2: SOLR-12913: Add new facet expression and pivot docs

SOLR-12913: Add new facet expression and pivot docs


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/531b1663
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/531b1663
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/531b1663

Branch: refs/heads/jira/http2
Commit: 531b16633acc8c398a20ca2a52b7ded3901702e6
Parents: ff1df8a
Author: Joel Bernstein <jb...@apache.org>
Authored: Wed Nov 7 15:07:21 2018 -0500
Committer: Joel Bernstein <jb...@apache.org>
Committed: Wed Nov 7 15:07:46 2018 -0500

----------------------------------------------------------------------
 .../src/stream-source-reference.adoc            | 14 +++-
 solr/solr-ref-guide/src/vectorization.adoc      | 80 ++++++++++++++++++++
 2 files changed, 90 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/531b1663/solr/solr-ref-guide/src/stream-source-reference.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/stream-source-reference.adoc b/solr/solr-ref-guide/src/stream-source-reference.adoc
index c31639a..c83991e 100644
--- a/solr/solr-ref-guide/src/stream-source-reference.adoc
+++ b/solr/solr-ref-guide/src/stream-source-reference.adoc
@@ -130,8 +130,12 @@ The `facet` function provides aggregations that are rolled up over buckets. Unde
 * `collection`: (Mandatory) Collection the facets will be aggregated from.
 * `q`: (Mandatory) The query to build the aggregations from.
 * `buckets`: (Mandatory) Comma separated list of fields to rollup over. The comma separated list represents the dimensions in a multi-dimensional rollup.
-* `bucketSorts`: Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
-* `bucketSizeLimit`: The number of buckets to include. This value is applied to each dimension. '-1' will fetch all the buckets.
+* `bucketSorts`: (Mandatory) Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
+* `rows`: (Default 10) The number of rows to return. '-1' will return all rows.
+* `offset`:(Default 0) The offset in the result set to start from.
+* `overfetch`: (Default 150) Over-fetching is used to provide accurate aggregations over high cardinality fields.
+* `method`: The JSON facet API aggregation method.
+* `bucketSizeLimit`: Sets the absolute number of rows to fetch. This is incompatible with rows, offset and overfetch. This value is applied to each dimension. '-1' will fetch all the buckets.
 * `metrics`: List of metrics to compute for the buckets. Currently supported metrics are `sum(col)`, `avg(col)`, `min(col)`, `max(col)`, `count(*)`.
 
 === facet Syntax
@@ -144,7 +148,7 @@ facet(collection1,
       q="*:*",
       buckets="a_s",
       bucketSorts="sum(a_i) desc",
-      bucketSizeLimit=100,
+      rows=100,
       sum(a_i),
       sum(a_f),
       min(a_i),
@@ -166,7 +170,8 @@ facet(collection1,
       q="*:*",
       buckets="year_i, month_i, day_i",
       bucketSorts="year_i desc, month_i desc, day_i desc",
-      bucketSizeLimit=100,
+      rows=10,
+      offset=20,
       sum(a_i),
       sum(a_f),
       min(a_i),
@@ -179,6 +184,7 @@ facet(collection1,
 ----
 
 The example above shows a facet function with rollups over three buckets, where the buckets are returned in descending order by bucket value.
+The rows param returns 10 rows and the offset param starts returning rows from the 20th row.
 
 == features
 

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/531b1663/solr/solr-ref-guide/src/vectorization.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc
index 5fdfadc..acd56ec 100644
--- a/solr/solr-ref-guide/src/vectorization.adoc
+++ b/solr/solr-ref-guide/src/vectorization.adoc
@@ -31,6 +31,12 @@ to vectorize and analyze the results sets.
 
 Below are some of the key stream sources:
 
+* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
+co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
+under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
+the aggregated results can be pivoted into a co-occurance matrix which can be mined for
+correlations and hidden similarities within the data.
+
 * *`random`*: Random sampling is widely used in statistics, probability and machine learning.
 The `random` function returns a random sample of search results that match a
 query. The random samples can be vectorized and operated on by math expressions and the results
@@ -242,6 +248,80 @@ When this expression is sent to the `/stream` handler it responds with:
 }
 ----
 
+== Facet Co-Occurrence Matrices
+
+The `facet` function can be used to quickly perform mulit-dimension aggregations of categorical data from
+records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
+counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
+aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
+correlations to learn about the hidden connections within the data.
+
+In the example below th `facet` expression is used to generate a two dimensional faceted aggregation.
+The first dimension is the US State that a car was purchased in and the second dimension is the car model.
+The two dimensional facet generates the co-occurrence counts for the number of times a particular car model
+was purchased in a particular state.
+
+
+[source,text]
+----
+facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
+----
+
+When this expression is sent to the `/stream` handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "state": "NY",
+        "model": "camry",
+        "count(*)": 13342
+      },
+      {
+        "state": "NJ",
+        "model": "accord",
+        "count(*)": 13002
+      },
+      {
+        "state": "NY",
+        "model": "civic",
+        "count(*)": 12901
+      },
+      {
+        "state": "CA",
+        "model": "focus",
+        "count(*)": 12892
+      },
+      {
+        "state": "TX",
+        "model": "f150",
+        "count(*)": 12871
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 171
+      }
+    ]
+  }
+}
+----
+
+The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
+The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
+columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
+ from facet results.  Once the co-occurrence matrix has been created the US States can be clustered
+by car model, or the matrix can be transposed and car models can be clustered by the US States
+where they were bought.
+
+[source,text]
+----
+let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
+    b=pivot(a, state, model, count(*)),
+    c=kmeans(b, 7))
+----
+
 == Latitude / Longitude Vectors
 
 The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into