You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by ri...@apache.org on 2018/02/20 23:28:16 UTC

[3/3] madlib git commit: Doc: Updates for multiple modules

Doc: Updates for multiple modules

User doc updates for Term frequency, PageRank, Matrix Ops and Test-train


Project: http://git-wip-us.apache.org/repos/asf/madlib/repo
Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/7ffad038
Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/7ffad038
Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/7ffad038

Branch: refs/heads/master
Commit: 7ffad038878956ed5437dccbdf07fb371796a3c8
Parents: 90fcfab
Author: Frank McQuillan <fm...@pivotal.io>
Authored: Thu Feb 15 11:45:44 2018 -0800
Committer: Rahul Iyer <ri...@apache.org>
Committed: Tue Feb 20 15:27:39 2018 -0800

----------------------------------------------------------------------
 .../postgres/modules/graph/pagerank.sql_in      |   2 +-
 .../postgres/modules/linalg/matrix_ops.sql_in   |  98 +++---
 .../modules/sample/train_test_split.sql_in      |   3 +-
 .../modules/utilities/text_utilities.sql_in     | 342 +++++++++++--------
 4 files changed, 256 insertions(+), 189 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/madlib/blob/7ffad038/src/ports/postgres/modules/graph/pagerank.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/graph/pagerank.sql_in b/src/ports/postgres/modules/graph/pagerank.sql_in
index a6858bd..9b0ed2c 100644
--- a/src/ports/postgres/modules/graph/pagerank.sql_in
+++ b/src/ports/postgres/modules/graph/pagerank.sql_in
@@ -55,8 +55,8 @@ pagerank( vertex_table,
           vertex_id,
           edge_table,
           edge_args,
-          damping_factor,
           out_table,
+          damping_factor,
           max_iter,
           threshold,
           grouping_cols

http://git-wip-us.apache.org/repos/asf/madlib/blob/7ffad038/src/ports/postgres/modules/linalg/matrix_ops.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/linalg/matrix_ops.sql_in b/src/ports/postgres/modules/linalg/matrix_ops.sql_in
index 2329ded..bd59826 100644
--- a/src/ports/postgres/modules/linalg/matrix_ops.sql_in
+++ b/src/ports/postgres/modules/linalg/matrix_ops.sql_in
@@ -17,7 +17,7 @@ m4_include(`SQLCommon.m4')
 <ul>
 <li class="level1"><a href="#description">Description</a></li>
 <li class="level1"><a href="#operations">Matrix Operations</a></li>
-<li class="level1"><a href="#glossary">Glossary of arguments</a></li>
+<li class="level1"><a href="#arguments">Arguments</a></li>
 <li class="level1"><a href="#examples">Examples</a></li>
 <li class="level1"><a href="#related">Related Topics</a></li>
 </ul>
@@ -29,7 +29,7 @@ m4_include(`SQLCommon.m4')
 This module provides a set of basic matrix operations for matrices that are
 too big to fit in memory. We provide two storage formats for a matrix:
 
-- Dense: The matrix is represented as a distributed collection of 1-D arrays.
+- <b>Dense</b>: The matrix is represented as a distributed collection of 1-D arrays.
 An example 3x10 matrix would be the table below:
 <pre>
  row_id |         row_vec
@@ -45,7 +45,7 @@ above)  provides each row as an array. <b>The <em>row</em> column should contain
 series of integers from 1 to <em>N</em> with no duplicates,
 where <em>N</em> is the row dimensionality</b>.
 
-- Sparse: The matrix is represented using the row and column indices for each
+- <b>Sparse</b>: The matrix is represented using the row and column indices for each
 non-zero entry of the matrix. This representation is useful for sparse matrices,
 containing multiple zero elements. Given below is an example of a sparse 4x7 matrix
 with just 6 out of 28 entries being non-zero. The dimensionality of the matrix is
@@ -92,7 +92,7 @@ including SMALLINT, INTEGER, BIGINT, DOUBLE PRECISION (FLOAT8), NUMERIC
 
 Below are the supported matrix operations. The meaning of the arguments
 and other terms are common to all functions and are provided at the end of the list
-in the glossary.
+in the glossary of arguments.
 
 - \b Representation
 <pre class="syntax">
@@ -190,7 +190,7 @@ operations applicable to smaller matrices since the calculation is not distribut
 -- Matrix generic inverse
 &nbsp; <b>matrix_pinv</b>( matrix_in, in_args, matrix_out, out_args)
 &nbsp;
--- Matrix eigen extraction
+-- Matrix eigenvalue extraction
 &nbsp; <b>matrix_eigen</b>( matrix_in, in_args, matrix_out, out_args)
 &nbsp;
 -- Matrix Cholesky decomposition
@@ -209,10 +209,10 @@ operations applicable to smaller matrices since the calculation is not distribut
 &nbsp; <b>matrix_rank</b>( matrix_in, in_args)
 </pre>
 
-@anchor glossary
-\b Glossary
+@anchor arguments
+\b Arguments
 
-The table below provides a glossary of the terms used in the matrix operations.
+The table below provides a glossary of the arguments used in the matrix operations.
 
 <dl class="arglist">
 <dt>matrix_in, matrix_a, matrix_b</dt>
@@ -371,12 +371,12 @@ Other supported norms for this string argument:
 </dl>
 
 @anchor examples
-@par Examples
+@par Examples (Dense Format)
 
 Here are some examples of matrix operations in dense format.
 Later on this page we will show examples of matrix operations in sparse format.
 
-- First let’s create example data tables in dense format:
+-# First let’s create example data tables in dense format:
 <pre class="syntax">
 CREATE TABLE "mat_A" (
     	row_id integer,
@@ -409,7 +409,7 @@ INSERT INTO "mat_B" (row_id, vector) VALUES (9, '{6,5,1,7,2,7,10,6,0,6}');
 INSERT INTO "mat_B" (row_id, vector) VALUES (10, '{1,4,4,4,8,5,2,8,5,5}');
 </pre>
 
-- Transpose a matrix
+-# Transpose a matrix
 <pre class="syntax">
 SELECT madlib.matrix_trans('"mat_B"', 'row=row_id, val=vector',
                            'mat_r');
@@ -432,7 +432,7 @@ SELECT * FROM mat_r ORDER BY row_id;
 (10 rows)
 </pre>
 
-- Extract main diagonal of a matrix
+-# Extract main diagonal of a matrix
 <pre class="syntax">
 SELECT madlib.matrix_extract_diag('"mat_B"', 'row=row_id, val=vector');
 </pre>
@@ -444,7 +444,7 @@ SELECT madlib.matrix_extract_diag('"mat_B"', 'row=row_id, val=vector');
 (1 row)
 </pre>
 
-- Add two matrices
+-# Add two matrices
 <pre class="syntax">
 SELECT madlib.matrix_add('"mat_A"', 'row=row_id, val=row_vec',
                          '"mat_B"', 'row=row_id, val=vector',
@@ -467,7 +467,7 @@ SELECT * FROM mat_r ORDER BY row_id;
 (10 rows)
 </pre>
 
-- Multiply two matrices
+-# Multiply two matrices
 <pre class="syntax">
 SELECT madlib.matrix_mult('"mat_A"', 'row=row_id, val=row_vec',
                           '"mat_B"', 'row=row_id, val=vector, trans=true',
@@ -490,7 +490,7 @@ SELECT * FROM mat_r ORDER BY row_id;
 (10 rows)
 </pre>
 
-- Create a diagonal matrix
+-# Create a diagonal matrix
 <pre class="syntax">
 SELECT madlib.matrix_diag(array[9,6,3,10],
                           'mat_r', 'row=row_id, col=col_id, val=val');
@@ -506,7 +506,7 @@ SELECT * FROM mat_r ORDER BY row_id::bigint;
 (11 rows)
 </pre>
 
-- Create an identity matrix
+-# Create an identity matrix
 <pre class="syntax">
 SELECT madlib.matrix_identity(4, 'mat_r', 'row=row_id,col=col_id,val=val');
 SELECT * FROM mat_r ORDER BY row_id;
@@ -521,7 +521,7 @@ SELECT * FROM mat_r ORDER BY row_id;
 (5 rows)
 </pre>
 
-- Extract row and column from a matrix by specifying index
+-# Extract row and column from a matrix by specifying index
 <pre class="syntax">
 SELECT madlib.matrix_extract_row('"mat_A"', 'row=row_id, val=row_vec', 2) as row,
        madlib.matrix_extract_col('"mat_A"', 'row=row_id, val=row_vec', 3) as col;
@@ -533,7 +533,7 @@ SELECT madlib.matrix_extract_row('"mat_A"', 'row=row_id, val=row_vec', 2) as row
 (1 rows)
 </pre>
 
-- Get min and max values along a specific dimension, as well as the corresponding indices.
+-# Get min and max values along a specific dimension, as well as the corresponding indices.
 Note that in this example <em>dim=2</em> implies that the min and max is computed on each row,
 returning a column vector i.e. the column (dim=2) is flattened.
 <pre class="syntax">
@@ -554,7 +554,7 @@ SELECT * from mat_min_r;
 (1 rows)
 </pre>
 
-- Initialize matrix with zeros in sparse format
+-# Initialize matrix with zeros in sparse format
 <pre class="syntax">
 SELECT madlib.matrix_zeros(5, 4, 'mat_r', 'row=row_id, col=col_id, val=entry');
 SELECT * FROM mat_r;
@@ -566,7 +566,7 @@ SELECT * FROM mat_r;
 (1 rows)
 </pre>
 
-- Initialize matrix with zeros in dense format
+-# Initialize matrix with zeros in dense format
 </pre>
 <pre class="syntax">
 SELECT madlib.matrix_zeros(5, 4, 'mat_r', 'fmt=dense');
@@ -583,7 +583,7 @@ SELECT * FROM mat_r ORDER BY row;
 (5 rows)
 </pre>
 
-- Initialize matrix with ones
+-# Initialize matrix with ones
 </pre>
 <pre class="syntax">
 SELECT madlib.matrix_ones(5, 4, 'mat_r', 'row=row,col=col, val=val');
@@ -615,7 +615,7 @@ SELECT * FROM mat_r;
 (20 rows)
 </pre>
 
-- Initialize matrix with ones in dense format
+-# Initialize matrix with ones in dense format
 </pre>
 <pre class="syntax">
 SELECT madlib.matrix_ones(5, 4, 'mat_r', 'fmt=dense');
@@ -632,7 +632,7 @@ SELECT * FROM mat_r ORDER BY row;
 (5 rows)
 </pre>
 
-- Element-wise multiplication between two matrices
+-# Element-wise multiplication between two matrices
 <pre class="syntax">
 SELECT madlib.matrix_elem_mult('"mat_A"', 'row=row_id, val=row_vec',
                                '"mat_B"', 'row=row_id, val=vector',
@@ -654,7 +654,7 @@ SELECT * FROM mat_r ORDER BY row_id;
      10 | {4,24,12,8,48,20,2,16,15,40}
 </pre>
 
-- Get sum values along a dimension. In this example, the sum is
+-# Get sum values along a dimension. In this example, the sum is
 computed for each row (i.e. column is flattened since dim=2).
 <pre class="syntax">
 SELECT madlib.matrix_sum('"mat_A"', 'row=row_id, val=row_vec', 2);
@@ -666,7 +666,7 @@ SELECT madlib.matrix_sum('"mat_A"', 'row=row_id, val=row_vec', 2);
 (1 rows)
 </pre>
 
-- Get mean values along dimension
+-# Get mean values along dimension
 <pre class="syntax">
 SELECT madlib.matrix_mean('"mat_A"', 'row=row_id, val=row_vec', 2);
 </pre>
@@ -677,7 +677,7 @@ SELECT madlib.matrix_mean('"mat_A"', 'row=row_id, val=row_vec', 2);
 (1 rows)
 </pre>
 
-- Compute matrix norm.  In this example, we ask for the Euclidean norm:
+-# Compute matrix norm.  In this example, we ask for the Euclidean norm:
 <pre class="syntax">
 SELECT madlib.matrix_norm('"mat_A"', 'row=row_id, val=row_vec', '2');
 </pre>
@@ -688,7 +688,7 @@ SELECT madlib.matrix_norm('"mat_A"', 'row=row_id, val=row_vec', '2');
 (1 row)
 </pre>
 
-- Multiply matrix with scalar
+-# Multiply matrix with scalar
 <pre class="syntax">
 SELECT madlib.matrix_scalar_mult('"mat_A"', 'row=row_id, val=row_vec', 3, 'mat_r');
 SELECT * FROM mat_r ORDER BY row_id;
@@ -709,7 +709,7 @@ SELECT * FROM mat_r ORDER BY row_id;
 (10 rows)
 </pre>
 
-- Get the row dimension and column dimension of matrix
+-# Get the row dimension and column dimension of matrix
 <pre class="syntax">
 SELECT madlib.matrix_ndims('"mat_A"', 'row=row_id, val=row_vec');
 </pre>
@@ -720,7 +720,7 @@ SELECT madlib.matrix_ndims('"mat_A"', 'row=row_id, val=row_vec');
 (1 row)
 </pre>
 
-- Multiply matrix with vector
+-# Multiply matrix with vector
 <pre class="syntax">
 SELECT madlib.matrix_vec_mult('"mat_A"', 'row=row_id, val=row_vec',
                               array[1,2,3,4,5,6,7,8,9,10]);
@@ -732,25 +732,25 @@ SELECT madlib.matrix_vec_mult('"mat_A"', 'row=row_id, val=row_vec',
 (10 rows)
 </pre>
 
-- Inverse of matrix
+-# Inverse of matrix
 <pre class="syntax">
 SELECT madlib.matrix_inverse('"mat_A"', 'row=row_id, val=row_vec', 'mat_r');
 SELECT row_vec FROM mat_r ORDER BY row_id;
 </pre>
 
-- Generic inverse of matrix
+-# Generic inverse of matrix
 <pre class="syntax">
 SELECT madlib.matrix_pinv('"mat_A"', 'row=row_id, val=row_vec', 'mat_r');
 SELECT row_vec FROM mat_r ORDER BY row_id;
 </pre>
 
-- Eigen values of matrix (note default column name of eigenvalues)
+-# Eigenvalues of matrix (note default column name of eigenvalues)
 <pre class="syntax">
 SELECT madlib.matrix_eigen('"mat_A"', 'row=row_id, val=row_vec', 'mat_r');
 SELECT eigen_values FROM mat_r ORDER BY row_id;
 </pre>
 
-- Cholesky decomposition of matrix
+-# Cholesky decomposition of matrix
 <pre class="syntax">
 SELECT madlib.matrix_cholesky('"mat_A"', 'row=row_id, val=row_vec', 'matrix_out_prefix');
 SELECT row_vec FROM matrix_out_prefix_p ORDER BY row_id;
@@ -758,14 +758,14 @@ SELECT row_vec FROM matrix_out_prefix_l ORDER BY row_id;
 SELECT row_vec FROM matrix_out_prefix_d ORDER BY row_id;
 </pre>
 
-- QR decomposition of matrix
+-# QR decomposition of matrix
 <pre class="syntax">
 SELECT madlib.matrix_qr('"mat_A"', 'row=row_id, val=row_vec', 'matrix_out_prefix');
 SELECT row_vec FROM matrix_out_prefix_q ORDER BY row_id;
 SELECT row_vec FROM matrix_out_prefix_r ORDER BY row_id;
 </pre>
 
-- LU decomposition of matrix
+-# LU decomposition of matrix
 <pre class="syntax">
 SELECT madlib.matrix_lu('"mat_A"', 'row=row_id, val=row_vec', 'matrix_out_prefix');
 SELECT row_vec FROM matrix_out_prefix_l ORDER BY row_id;
@@ -774,7 +774,7 @@ SELECT row_vec FROM matrix_out_prefix_p ORDER BY row_id;
 SELECT row_vec FROM matrix_out_prefix_q ORDER BY row_id;
 </pre>
 
-- Nuclear norm of matrix
+-# Nuclear norm of matrix
 <pre class="syntax">
 SELECT madlib.matrix_nuclear_norm('"mat_A"', 'row=row_id, val=row_vec');
 </pre>
@@ -785,7 +785,7 @@ SELECT madlib.matrix_nuclear_norm('"mat_A"', 'row=row_id, val=row_vec');
 (1 row)
 </pre>
 
-- Rank of matrix
+-# Rank of matrix
 <pre class="syntax">
 SELECT madlib.matrix_rank('"mat_A"', 'row=row_id, val=row_vec');
 </pre>
@@ -796,16 +796,18 @@ SELECT madlib.matrix_rank('"mat_A"', 'row=row_id, val=row_vec');
 (1 row)
 </pre>
 
+@par Examples (Sparse Format)
+
 Below are some examples of matrix operations in sparse format.
 
-- Convert a matrix from dense to sparse format
+-# Convert a matrix from dense to sparse format
 <pre class="syntax">
 SELECT madlib.matrix_sparsify('"mat_B"', 'row=row_id, val=vector',
                               '"mat_B_sparse"', 'col=col_id, val=val');
 SELECT * FROM "mat_B_sparse" ORDER BY row_id, col_id;
 </pre>
 
-- Create a matrix in sparse format.
+-# Create a matrix in sparse format.
 <pre class="syntax">
 CREATE TABLE "mat_A_sparse" (
     "rowNum" integer,
@@ -830,7 +832,7 @@ INSERT INTO "mat_A_sparse" ("rowNum", col_num, entry) VALUES (9, 2, 3);
 INSERT INTO "mat_A_sparse" ("rowNum", col_num, entry) VALUES (10, 10, 0);
 </pre>
 
-- Get the row_dims and col_dims of a matrix in sparse format
+-# Get the row_dims and col_dims of a matrix in sparse format
 <pre class="syntax">
 SELECT madlib.matrix_ndims('"mat_A_sparse"', 'row="rowNum", val=entry')
 </pre>
@@ -841,7 +843,7 @@ SELECT madlib.matrix_ndims('"mat_A_sparse"', 'row="rowNum", val=entry')
 (1 row)
 </pre>
 
-- Transpose a matrix in sparse format
+-# Transpose a matrix in sparse format
 <pre class="syntax">
 -- Note the double quotes for "rowNum" are required as per PostgreSQL rules since “N” is capitalized
 SELECT madlib.matrix_trans('"mat_A_sparse"', 'row="rowNum", val=entry',
@@ -870,7 +872,7 @@ SELECT "rowNum", col_num, entry FROM matrix_r_sparse ORDER BY col_num;
 (16 rows)
 </pre>
 
-- Main diagonal of a matrix in sparse format
+-# Main diagonal of a matrix in sparse format
 <pre class="syntax">
 SELECT madlib.matrix_extract_diag('"mat_A_sparse"', 'row="rowNum", val=entry');
 </pre>
@@ -881,7 +883,7 @@ SELECT madlib.matrix_extract_diag('"mat_A_sparse"', 'row="rowNum", val=entry');
 (1 row)
 </pre>
 
-- Add two sparse matrices then convert to dense format
+-# Add two sparse matrices then convert to dense format
 <pre class="syntax">
 SELECT madlib.matrix_add('"mat_A_sparse"', 'row="rowNum", val=entry',
                          '"mat_B_sparse"', 'row=row_id, col=col_id, val=val',
@@ -906,7 +908,7 @@ SELECT * FROM matrix_r ORDER BY "rowNum";
 (10 rows)
 </pre>
 
-- Multiply two sparse matrices
+-# Multiply two sparse matrices
 <pre class="syntax">
 SELECT madlib.matrix_mult('"mat_A_sparse"', 'row="rowNum", col=col_num, val=entry',
                           '"mat_B_sparse"', 'row=row_id, col=col_id, val=val, trans=true',
@@ -929,7 +931,7 @@ SELECT * FROM matrix_r ORDER BY "rowNum";
 (10 rows)
 </pre>
 
-- Initialize matrix with ones
+-# Initialize matrix with ones
 </pre>
 <pre class="syntax">
 SELECT madlib.matrix_ones(5, 4, 'mat_r', 'row=row,col=col, val=val');
@@ -961,7 +963,7 @@ SELECT * FROM mat_r ORDER BY row, col;
 (20 rows)
 </pre>
 
-- Initialize matrix with zeros in sparse format
+-# Initialize matrix with zeros in sparse format
 </pre>
 <pre class="syntax">
 SELECT madlib.matrix_zeros(5, 4, 'mat_r', 'row=row_id, col=col_id, val=entry');
@@ -974,7 +976,7 @@ SELECT * FROM mat_r;
 (1 rows)
 </pre>
 
-- Compute matrix norm on sparse matrix.  In this example, we ask for the Euclidean norm:
+-# Compute matrix norm on sparse matrix.  In this example, we ask for the Euclidean norm:
 <pre class="syntax">
 SELECT madlib.matrix_norm('"mat_A_sparse"', 'row="rowNum", col=col_num, val=entry', '2');
 </pre>
@@ -2696,7 +2698,7 @@ $$ LANGUAGE plpgsql IMMUTABLE
 m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `CONTAINS SQL', `');
 
 /**
- * @brief Calculate eigen values of matrix. It requires that all the values are NON-NULL.
+ * @brief Calculate eigenvalues of matrix. It requires that all the values are NON-NULL.
  *
  * @param matrix_in Name of the table containing the input matrix
  * @param matrix_out Name of the table where to output the result matrix

http://git-wip-us.apache.org/repos/asf/madlib/blob/7ffad038/src/ports/postgres/modules/sample/train_test_split.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/sample/train_test_split.sql_in b/src/ports/postgres/modules/sample/train_test_split.sql_in
index d5d0913..2f40015 100644
--- a/src/ports/postgres/modules/sample/train_test_split.sql_in
+++ b/src/ports/postgres/modules/sample/train_test_split.sql_in
@@ -56,7 +56,8 @@ train_test_split(   source_table,
                     test_proportion,
                     grouping_cols,
                     target_cols,
-                    with_replacement
+                    with_replacement,
+                    separate_output_tables
                 )
 </pre>
 

http://git-wip-us.apache.org/repos/asf/madlib/blob/7ffad038/src/ports/postgres/modules/utilities/text_utilities.sql_in
----------------------------------------------------------------------
diff --git a/src/ports/postgres/modules/utilities/text_utilities.sql_in b/src/ports/postgres/modules/utilities/text_utilities.sql_in
index 5e039e6..970dba7 100644
--- a/src/ports/postgres/modules/utilities/text_utilities.sql_in
+++ b/src/ports/postgres/modules/utilities/text_utilities.sql_in
@@ -16,21 +16,24 @@ m4_include(`SQLCommon.m4')
 
 <div class="toc"><b>Contents</b>
   <ul>
-    <li><a href="#term_frequency">Term Frequency</a></li>
+    <li><a href="#function_syntax">Function Syntax</a></li>
     <li><a href="#examples">Examples</a></li>
-    <li><a href="#rel;ated">Related Topics</a></li>
+    <li><a href="#related">Related Topics</a></li>
   </ul>
 </div>
 
-@brief Provides a collection of user-defined functions for performing common
-tasks related to text.
+@brief Provides a collection of functions for performing common
+tasks related to text analytics.
 
-@anchor term_frequency
-@par Term frequency
-    Term frequency \c tf(t,d) is to the raw frequency of a word/term in a document,
-    i.e. the number of times that word/term \c t occurs in document \c d. For
-    this function, 'word' and 'term' are used interchangeably.
-    <b>Note:</b> the term frequency is not normalized by the document length.
+Term frequency computes the number of times that a word 
+or term occurs in a document.  Term frequency is often 
+used as part of a larger text processing pipeline, which may 
+include operations such as stemming, stop word removal
+and topic modelling.
+
+@anchor function_syntax
+@par Function Syntax
+ 
 <pre class="syntax">
     term_frequency(input_table,
                    doc_id_col,
@@ -43,206 +46,267 @@ tasks related to text.
 <dl class="arglist">
     <dt>input_table</dt>
     <dd>TEXT.
-    The name of the table storing the documents.
-    Each row is in the form &lt;doc_id, word_vector&gt; where \c doc_id is an id,
+    The name of the table containing the documents, with one
+    document per row.
+    Each row is in the form &lt;doc_id, word_vector&gt; where \c doc_id is an id
     unique to each document, and  \c word_vector is a text array containing the
     words in the document. The \c word_vector should contain multiple entries of
     a word if the document contains multiple occurrence of that word.
     </dd>
 
-    <dt>id_col</dt>
+    <dt>doc_id_col</dt>
     <dd>TEXT.
     The name of the column containing the document id. </dd>
 
     <dt>word_col</dt>
     <dd>TEXT.
     The name of the column containing the vector of words/terms in the
-    document. This column should of type that can be cast to TEXT[].</dd>
+    document. This column should be of type that can be cast to TEXT[].</dd>
 
     <dt>output_table</dt>
     <dd>TEXT.
     The name of the table to store the term frequency output.
     The output table contains the following columns:
-        - \c id_col: This the document id column (same as the one provided as input).
-        - \c word: A word/term present in a document. This is either the original
-        word present in \c word_col or an id representing the word (depending on the value of compute_vocab below).
+        - \c doc_id_col: This the document id column 
+        (name will be same as the one provided as input).
+        - \c word: Word/term present in a document. Depending on the value 
+        of \c compute_vocab below, this is either the original
+        word as it appears in \c word_col, or an id representing the word.
+        Note that word id's start from 0 not 1.
         - \c count: The number of times this word is found in the document.
     </dd>
 
     <dt>compute_vocab</dt>
     <dd>BOOLEAN. (Optional, Default=FALSE)
-    Flag to indicate if a vocabulary is to be created. If TRUE, an additional
+    Flag to indicate if a vocabulary table is to be created. If TRUE, an additional
     output table is created containing the vocabulary of all words, with an id
-    assigned to each word. The table is called <em>output_table</em>_vocabulary
-    (suffix added to the <em>output_table</em> name) and contains the
+    assigned to each word in alphabetical order. 
+    The table is called <em>output_table</em>_vocabulary
+    (i.e., suffix added to the <em>output_table</em> name) and contains the
     following columns:
-        - \c wordid: An id assignment for each word
-        - \c word: The word/term
+        - \c wordid: An id for each word in alphabetical order.
+        - \c word: The word/term corresponding to the id.
     </dd>
 </dl>
 
 @anchor examples
 @par Examples
 
--# Prepare datasets with some example documents
+-# First we create a document table with one document per row:
 <pre class="example">
 DROP TABLE IF EXISTS documents;
-CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
+CREATE TABLE documents(docid INT4, contents TEXT);
 INSERT INTO documents VALUES
-(1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'),
-(2, 'Chinchillas and kittens are cute.'),
-(3, 'My sister adopted two kittens yesterday'),
-(4, 'Look at this cute hamster munching on a piece of broccoli');
+(0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'),
+(1, 'Chinchillas and kittens are cute.'),
+(2, 'My sister adopted two kittens yesterday.'),
+(3, 'Look at this cute hamster munching on a piece of broccoli.');
 </pre>
-
--# Add a new column containing the words (lower-cased) in a text array
+You can apply stemming, stop word removal and tokenization at this point 
+in order to prepare the documents for text processing. 
+Depending upon your database version, various tools are 
+available. Databases based on more recent versions of 
+PostgreSQL may do something like:
+<pre class="example">
+SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
+</pre>
+<pre class="result">
+                    tsvector_to_array                     
++----------------------------------------------------------
+ {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
+ {chinchilla,cute,kitten}
+ {adopt,kitten,sister,two,yesterday}
+ {broccoli,cute,hamster,look,munch,piec}
+(4 rows)
+</pre>
+In this example, we assume a database based on an older 
+version of PostgreSQL and just perform basic punctuation 
+removal and tokenization. The array of words is added as 
+a new column to the documents table:
 <pre class="example">
 ALTER TABLE documents ADD COLUMN words TEXT[];
-UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[\\\\s+\\\\.]');
+UPDATE documents SET words = 
+    regexp_split_to_array(lower(
+    regexp_replace(contents, E'[,.;\\']','', 'g')
+    ), E'[\\\\s+]');
+\\x on   
+SELECT * FROM documents ORDER BY docid;
+</pre>
+<pre class="result">
+-[ RECORD 1 ]------------------------------------------------------------------------------------
+docid    | 0
+contents | I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.
+words    | {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
+-[ RECORD 2 ]------------------------------------------------------------------------------------
+docid    | 1
+contents | Chinchillas and kittens are cute.
+words    | {chinchillas,and,kittens,are,cute}
+-[ RECORD 3 ]------------------------------------------------------------------------------------
+docid    | 2
+contents | My sister adopted two kittens yesterday.
+words    | {my,sister,adopted,two,kittens,yesterday}
+-[ RECORD 4 ]------------------------------------------------------------------------------------
+docid    | 3
+contents | Look at this cute hamster munching on a piece of broccoli.
+words    | {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
 </pre>
 
--# Compute the frequency of each word in each document
+-# Compute the frequency of each word in each document:
 <pre class="example">
-DROP TABLE IF EXISTS documents_tf;
-SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf');
-SELECT * FROM documents_tf order by docid;
+DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
+SELECT madlib.term_frequency('documents',    -- input table
+                             'docid',        -- document id column
+                             'words',        -- vector of words in document
+                             'documents_tf'  -- output table
+                            );
+\\x off
+SELECT * FROM documents_tf ORDER BY docid;
 </pre>
 <pre class="result">
- docid |    word    | count
--------+------------+-------
-     1 | ate        |     1
-     1 | like       |     1
-     1 | breakfast  |     1
-     1 | to         |     1
-     1 | broccoli   |     1
-     1 | spinach    |     1
-     1 | i          |     2
-     1 | and        |     2
-     1 | a          |     1
-     1 |            |     2
-     1 | smoothie   |     1
-     1 | eat        |     1
-     1 | banana     |     2
-     1 | for        |     1
-     2 | cute       |     1
-     2 | are        |     1
-     2 | kitten     |     1
-     2 | and        |     1
-     2 | chinchilla |     1
-     3 | kitten     |     1
-     3 | my         |     1
-     3 | a          |     1
-     3 | sister     |     1
-     3 | adopted    |     1
-     3 | yesterday  |     1
-     4 | at         |     1
-     4 | of         |     1
-     4 | piece      |     1
-     4 | this       |     1
-     4 | a          |     1
-     4 | broccoli   |     1
-     4 | hamster    |     1
-     4 | munching   |     1
-     4 | cute       |     1
-     4 | look       |     1
-(35 rows)
+ docid |    word     | count 
+-------+-------------+-------
+     0 | a           |     1
+     0 | breakfast   |     1
+     0 | banana      |     1
+     0 | and         |     2
+     0 | eat         |     1
+     0 | smoothie    |     1
+     0 | to          |     1
+     0 | like        |     1
+     0 | broccoli    |     1
+     0 | bananas     |     1
+     0 | spinach     |     1
+     0 | i           |     2
+     0 | ate         |     1
+     0 | for         |     1
+     1 | are         |     1
+     1 | cute        |     1
+     1 | kittens     |     1
+     1 | chinchillas |     1
+     1 | and         |     1
+     2 | two         |     1
+     2 | yesterday   |     1
+     2 | kittens     |     1
+     2 | sister      |     1
+     2 | my          |     1
+     2 | adopted     |     1
+     3 | this        |     1
+     3 | at          |     1
+     3 | a           |     1
+     3 | broccoli    |     1
+     3 | of          |     1
+     3 | look        |     1
+     3 | hamster     |     1
+     3 | on          |     1
+     3 | piece       |     1
+     3 | cute        |     1
+     3 | munching    |     1
+(36 rows)
 </pre>
 
--# We also can create a vocabulary of the words and store a wordid in the output
-table instead of the actual word.
+-# Next we create a vocabulary of the words 
+and store a wordid in the output table instead of the 
+actual word:
 <pre class="example">
-DROP TABLE IF EXISTS documents_tf;
-DROP TABLE IF EXISTS documents_tf_vocabulary;
-SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf', TRUE);
--- Output with wordid instead of the actual words
-SELECT * FROM documents_tf order by docid;
+DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
+SELECT madlib.term_frequency('documents',    -- input table
+                             'docid',        -- document id column
+                             'words',        -- vector of words in document
+                             'documents_tf',-- output table
+                             TRUE
+                            );
+SELECT * FROM documents_tf ORDER BY docid;
 </pre>
 \nbsp
 <pre class="result">
- docid | wordid | count
+ docid | wordid | count 
 -------+--------+-------
-     1 |      0 |     2
-     1 |      1 |     1
-     1 |      3 |     2
-     1 |      6 |     1
-     1 |      7 |     2
-     1 |      8 |     1
-     1 |      9 |     1
-     1 |     12 |     1
-     1 |     13 |     1
-     1 |     15 |     2
-     1 |     17 |     1
-     1 |     24 |     1
-     1 |     25 |     1
-     1 |     27 |     1
+     0 |     17 |     1
+     0 |      9 |     1
+     0 |     25 |     1
+     0 |     12 |     1
+     0 |     13 |     1
+     0 |     15 |     2
+     0 |      0 |     1
+     0 |      2 |     2
+     0 |     28 |     1
+     0 |      5 |     1
+     0 |      6 |     1
+     0 |      7 |     1
+     0 |      8 |     1
+     0 |     26 |     1
+     1 |     16 |     1
+     1 |     11 |     1
+     1 |     10 |     1
+     1 |      2 |     1
+     1 |      3 |     1
+     2 |     30 |     1
+     2 |      1 |     1
      2 |     16 |     1
-     2 |      3 |     1
-     2 |      4 |     1
-     2 |     10 |     1
-     2 |     11 |     1
-     3 |      1 |     1
-     3 |     16 |     1
-     3 |     28 |     1
+     2 |     20 |     1
+     2 |     24 |     1
+     2 |     29 |     1
+     3 |      4 |     1
+     3 |     21 |     1
+     3 |     22 |     1
      3 |     23 |     1
-     3 |      2 |     1
-     3 |     20 |     1
-     4 |      9 |     1
-     4 |     11 |     1
-     4 |     22 |     1
-     4 |     14 |     1
-     4 |     26 |     1
-     4 |      1 |     1
-     4 |      5 |     1
-     4 |     18 |     1
-     4 |     19 |     1
-     4 |     21 |     1
-(35 rows)
+     3 |      0 |     1
+     3 |     11 |     1
+     3 |      9 |     1
+     3 |     27 |     1
+     3 |     14 |     1
+     3 |     18 |     1
+     3 |     19 |     1
+(36 rows)
 </pre>
 \nbsp
+Note above that wordid's start 
+at 0 not 1.  The vocabulary table maps wordid to the actual word:
 <pre class="example">
--- Vocabulary
-SELECT * FROM documents_tf_vocabulary order by wordid;
+SELECT * FROM documents_tf_vocabulary ORDER BY wordid;
 </pre>
 <pre class="result">
- wordid |    word
---------+------------
-      0 |
-      1 | a
-      2 | adopted
-      3 | and
-      4 | are
-      5 | at
-      6 | ate
-      7 | banana
+ wordid |    word     
+--------+-------------
+      0 | a
+      1 | adopted
+      2 | and
+      3 | are
+      4 | at
+      5 | ate
+      6 | banana
+      7 | bananas
       8 | breakfast
       9 | broccoli
-     10 | chinchilla
+     10 | chinchillas
      11 | cute
      12 | eat
      13 | for
      14 | hamster
      15 | i
-     16 | kitten
+     16 | kittens
      17 | like
      18 | look
      19 | munching
      20 | my
      21 | of
-     22 | piece
-     23 | sister
-     24 | smoothie
-     25 | spinach
-     26 | this
-     27 | to
-     28 | yesterday
-(29 rows)
+     22 | on
+     23 | piece
+     24 | sister
+     25 | smoothie
+     26 | spinach
+     27 | this
+     28 | to
+     29 | two
+     30 | yesterday
+(31 rows)
 </pre>
 
 @anchor related
 @par Related Topics
 
-File text_utilities.sql_in documenting the SQL functions.
-File utilities.sql_in documenting the utility functions for DB administration.
+See text_utilities.sql_in for the term frequency SQL function definition 
+and porter_stemmer.sql_in for the stemmer function.
 
 */