You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hivemall.apache.org by my...@apache.org on 2018/11/03 07:39:10 UTC

[2/2] incubator-hivemall git commit: Fixed term vector space tutorial

Fixed term vector space tutorial


Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/62a97798
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/62a97798
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/62a97798

Branch: refs/heads/master
Commit: 62a97798bbab688d0f24f5126c755c67209f31af
Parents: b97af4f
Author: Makoto Yui <my...@apache.org>
Authored: Sat Nov 3 16:38:47 2018 +0900
Committer: Makoto Yui <my...@apache.org>
Committed: Sat Nov 3 16:38:47 2018 +0900

----------------------------------------------------------------------
 docs/gitbook/SUMMARY.md                    |  5 +++--
 docs/gitbook/ft_engineering/bm25.md        | 24 +++++++++++++++++++++++-
 docs/gitbook/ft_engineering/term_vector.md |  3 +++
 3 files changed, 29 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 3484bfb..31a0311 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -65,8 +65,9 @@
 * [Feature Transformation](ft_engineering/ft_trans.md)
     * [Feature vectorization](ft_engineering/vectorization.md)
     * [Quantify non-number features](ft_engineering/quantify.md)
-* [TF-IDF Calculation](ft_engineering/tfidf.md)
-* [BM25](ft_engineering/bm25.md)
+* [Term Vector Model](ft_engineering/term_vector.md)
+    * [TF-IDF Term Weighting](ft_engineering/tfidf.md)
+    * [Okapi BM25 Term Weighting](ft_engineering/bm25.md)
 
 ## Part IV - Evaluation
 

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/ft_engineering/bm25.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/bm25.md b/docs/gitbook/ft_engineering/bm25.md
index 4ca029f..b70ecfe 100644
--- a/docs/gitbook/ft_engineering/bm25.md
+++ b/docs/gitbook/ft_engineering/bm25.md
@@ -139,7 +139,29 @@ from
 ;
 ```
 
-## Show important terms
+## Hyperparameters
+
+`bm25()`'s function signature and hyperparameters are as follows:
+
+```sql
+hive> select bm25();
+FAILED: SemanticException Line 1:7 Wrong arguments 'bm25':
+
+#arguments must be greater than or equal to 5: 0
+
+usage: bm25(double termFrequency, int docLength, double avgDocLength, int
+       numDocs, int numDocsWithTerm [, const string options]) - Return an
+       Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>]
+       [-min_idf <arg>]
+ -b <arg>                   Hyperparameter with type double in range 0.0
+                            and 1.0 [default: 0.75]
+ -d,--delta <arg>           Hyperparameter delta of BM25+ [default: 0.0]
+ -k1 <arg>                  Hyperparameter with type double, usually in
+                            range 1.2 and 2.0 [default: 1.2]
+ -min_idf,--epsilon <arg>   Hyperparameter delta of BM25+ [default: 1e-8]
+```
+
+## Show important terms for each document
 
 ```sql
 select

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/ft_engineering/term_vector.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/term_vector.md b/docs/gitbook/ft_engineering/term_vector.md
new file mode 100644
index 0000000..ff8c61f
--- /dev/null
+++ b/docs/gitbook/ft_engineering/term_vector.md
@@ -0,0 +1,3 @@
+Term vector model or [Vector space model](https://en.wikipedia.org/wiki/Vector_space_model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers.
+
+It is used in information filtering, information retrieval, relevancy rankings, and machine learning.