You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@madlib.apache.org by ok...@apache.org on 2017/12/28 22:51:50 UTC
[16/51] [abbrv] [partial] madlib-site git commit: Additional updates for 1.13 release

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6c103d3e/docs/v1.13/group__grp__summary.html
----------------------------------------------------------------------
diff --git a/docs/v1.13/group__grp__summary.html b/docs/v1.13/group__grp__summary.html
new file mode 100644
index 0000000..8582ba0
--- /dev/null
+++ b/docs/v1.13/group__grp__summary.html
@@ -0,0 +1,465 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.13"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
+<title>MADlib: Summary</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+  $(document).ready(initResizable);
+</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.13</span>
+   </div>
+   <div id="projectbrief">User Documentation for MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.13 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+$(document).ready(function(){initNavTree('group__grp__summary.html','');});
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Summary<div class="ingroups"><a class="el" href="group__grp__stats.html">Statistics</a> &raquo; <a class="el" href="group__grp__desc__stats.html">Descriptive Statistics</a></div></div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="toc"><b>Contents</b> <ul>
+<li>
+<a href="#usage">Summary Function Syntax</a> </li>
+<li>
+<a href="#examples">Examples</a> </li>
+<li>
+<a href="#notes">Notes</a> </li>
+<li>
+<a href="#related">Related Topics</a> </li>
+</ul>
+</div><p>The MADlib <b><a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a></b> function produces summary statistics for any data table. The function invokes various methods from the MADlib library to provide the data overview.</p>
+<p><a class="anchor" id="usage"></a></p><dl class="section user"><dt>Summary Function Syntax</dt><dd>The <b><a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a></b> function has the following syntax:</dd></dl>
+<pre class="syntax">
+summary ( source_table,
+          output_table,
+          target_cols,
+          grouping_cols,
+          get_distinct,
+          get_quartiles,
+          ntile_array,
+          how_many_mfv,
+          get_estimates,
+          n_cols_per_run
+        )
+</pre><p> The <b><a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a></b> function returns a composite type containing three fields: </p><table class="output">
+<tr>
+<th>output_table </th><td>TEXT. The name of the output table.  </td></tr>
+<tr>
+<th>row_count </th><td>INTEGER. The number of rows in the output table.  </td></tr>
+<tr>
+<th>duration </th><td>FLOAT8. The time taken (in seconds) to compute the summary.  </td></tr>
+</table>
+<p><b>Arguments</b> </p><dl class="arglist">
+<dt>source_table </dt>
+<dd><p class="startdd">TEXT. Name of the table containing the input data.</p>
+<p class="enddd"></p>
+</dd>
+<dt>output_table </dt>
+<dd><p class="startdd">TEXT. Name of the table for the output summary statistics. This table contains the following columns: </p><table class="output">
+<tr>
+<th>group_by </th><td>Group-by column name. NULL if none provided.  </td></tr>
+<tr>
+<th>group_by_value </th><td>Value of the group-by column. NULL if there is no grouping.  </td></tr>
+<tr>
+<th>target_column </th><td>Targeted column values for which summary is requested.  </td></tr>
+<tr>
+<th>column_number </th><td>Physical column number for the target column, as described in <em>pg_attribute</em>  catalog.  </td></tr>
+<tr>
+<th>data_type </th><td>Data type of the target column. Standard GPDB type descriptors are displayed.  </td></tr>
+<tr>
+<th>row_count </th><td>Number of rows for the target column.  </td></tr>
+<tr>
+<th>distinct_values </th><td>Number of distinct values in the target column. If the <a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a> function is called with the <em>get_estimates</em> argument set to TRUE (default), then this is an estimated statistic based on the Flajolet-Martin distinct count estimator. If the <em>get_estimates</em> argument set to FALSE, will use PostgreSQL COUNT DISTINCT.  </td></tr>
+<tr>
+<th>missing_values </th><td>Number of missing values in the target column.  </td></tr>
+<tr>
+<th>blank_values </th><td>Number of blank values. Blanks are defined by this regular expression:<pre class="fragment">'^\w*$'</pre>  </td></tr>
+<tr>
+<th>fraction_missing </th><td>Percentage of total rows that are missing, as a decimal value, e.g. 0.3.  </td></tr>
+<tr>
+<th>fraction_blank </th><td>Percentage of total rows that are blank, as a decimal value, e.g. 0.3.  </td></tr>
+<tr>
+<th>mean </th><td>Mean value of target column if target is numeric, otherwise NULL.  </td></tr>
+<tr>
+<th>variance </th><td>Variance of target column if target is numeric, otherwise NULL.  </td></tr>
+<tr>
+<th>min </th><td>Minimum value of target column. For strings this is the length of the shortest string.  </td></tr>
+<tr>
+<th>max </th><td>Maximum value of target column. For strings this is the length of the longest string.  </td></tr>
+<tr>
+<th>first_quartile </th><td>First quartile (25th percentile), only for numeric columns. (Unavailable for PostgreSQL 9.3 or lower.)  </td></tr>
+<tr>
+<th>median </th><td>Median value of target column, if target is numeric, otherwise NULL. (Unavailable for PostgreSQL 9.3 or lower.)  </td></tr>
+<tr>
+<th>third_quartile </th><td>Third quartile (25th percentile), only for numeric columns. (Unavailable for PostgreSQL 9.3 or lower.)  </td></tr>
+<tr>
+<th>quantile_array </th><td>Percentile values corresponding to <em>ntile_array</em>. (Unavailable for PostgreSQL 9.3 or lower.)  </td></tr>
+<tr>
+<th>most_frequent_values </th><td>An array containing the most frequently occurring values. The <em>how_many_mfv</em> argument determines the length of the array, which is 10 by default. If the <a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a> function is called with the <em>get_estimates</em> argument set to TRUE (default), the frequent values computation is performed using a parallel aggregation method that is faster, but in some cases may fail to detect the exact most frequent values.  </td></tr>
+<tr>
+<th>mfv_frequencies </th><td>Array containing the frequency count for each of the most frequent values.   </td></tr>
+</table>
+<p class="enddd"></p>
+</dd>
+<dt>target_columns (optional) </dt>
+<dd><p class="startdd">TEXT, default NULL. A comma-separated list of columns to summarize. If NULL, summaries are produced for all columns.</p>
+<p class="enddd"></p>
+</dd>
+<dt>grouping_cols (optional) </dt>
+<dd>TEXT, default: null. A comma-separated list of columns on which to group results. If NULL, summaries are produced for the complete table. <dl class="section note"><dt>Note</dt><dd>Please note that summary statistics are calculated for each grouping column independently. That is, grouping columns are not combined together as in the regular PostgreSQL style GROUP BY directive. (This was done to reduce long run time and huge output table size which would otherwise result in the case of large input tables with a lot of grouping_cols and target_cols specified.)</dd></dl>
+</dd>
+<dt>get_distinct (optional) </dt>
+<dd><p class="startdd">BOOLEAN, default TRUE. If true, distinct values are counted. The method for computing distinct values depends on the setting of the 'get_estimates' parameter below.</p>
+<p class="enddd"></p>
+</dd>
+<dt>get_quartiles (optional) </dt>
+<dd><p class="startdd">BOOLEAN, default TRUE. If TRUE, quartiles are computed.</p>
+<p class="enddd"></p>
+</dd>
+<dt>ntile_array (optional) </dt>
+<dd>FLOAT8[], default NULL. An array of quantile values to compute. If NULL, quantile values are not computed. <dl class="section note"><dt>Note</dt><dd>Quartile and quantile functions are not available in PostgreSQL 9.3 or lower. If you are using PostgreSQL 9.3 or lower, the output table will not contain these values, even if you set 'get_quartiles' = TRUE or provide an array of quantile values for the parameter 'ntile_array'.</dd></dl>
+</dd>
+<dt>how_many_mfv (optional) </dt>
+<dd><p class="startdd">INTEGER, default: 10. The number of most-frequent-values to compute. The method for computing MFV depends on the setting of the 'get_estimates' parameter below.</p>
+<p class="enddd"></p>
+</dd>
+<dt>get_estimates (optional) </dt>
+<dd><p class="startdd">BOOLEAN, default TRUE. If TRUE, estimated values are produced for distinct values and most frequent values. If FALSE, exact values are calculated which will take longer to run, with the impact depending on data size.</p>
+<p class="enddd"></p>
+</dd>
+<dt>n_cols_per_run (optional) </dt>
+<dd>INTEGER, default: 15. The number of columns to collect summary statistics in one pass of the data. This parameter determines the number of passes through the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run = 15', there will be 3 passes through the data, with each pass summarizing a maximum of 15 columns. <dl class="section note"><dt>Note</dt><dd>This parameter should be used with caution. Increasing this parameter could decrease the total run time (if number of passes decreases), but will increase the memory consumption during each run. Since PostgreSQL limits the memory available for a single aggregate run, this increased memory consumption could result in an out-of-memory termination error.</dd></dl>
+</dd>
+</dl>
+<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
+<ol type="1">
+<li>View online help for the <a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a> function. <pre class="example">
+SELECT * FROM madlib.summary();
+</pre></li>
+<li>Create an input data table using part of the well known iris data set. <pre class="example">
+DROP TABLE IF EXISTS iris;
+CREATE TABLE iris (id INT, sepal_length FLOAT, sepal_width FLOAT,
+                    petal_length FLOAT, petal_width FLOAT, 
+                   class_name text);                        
+INSERT INTO iris VALUES 
+(1,5.1,3.5,1.4,0.2,'Iris-setosa'),
+(2,4.9,3.0,1.4,0.2,'Iris-setosa'),
+(3,4.7,3.2,1.3,0.2,'Iris-setosa'),
+(4,4.6,3.1,1.5,0.2,'Iris-setosa'),
+(5,5.0,3.6,1.4,0.2,'Iris-setosa'),
+(6,5.4,3.9,1.7,0.4,'Iris-setosa'),
+(7,4.6,3.4,1.4,0.3,'Iris-setosa'),
+(8,5.0,3.4,1.5,0.2,'Iris-setosa'),
+(9,4.4,2.9,1.4,0.2,'Iris-setosa'),
+(10,4.9,3.1,1.5,0.1,'Iris-setosa'),
+(11,7.0,3.2,4.7,1.4,'Iris-versicolor'),
+(12,6.4,3.2,4.5,1.5,'Iris-versicolor'),
+(13,6.9,3.1,4.9,1.5,'Iris-versicolor'),
+(14,5.5,2.3,4.0,1.3,'Iris-versicolor'),
+(15,6.5,2.8,4.6,1.5,'Iris-versicolor'),
+(16,5.7,2.8,4.5,1.3,'Iris-versicolor'),
+(17,6.3,3.3,4.7,1.6,'Iris-versicolor'),
+(18,4.9,2.4,3.3,1.0,'Iris-versicolor'),
+(19,6.6,2.9,4.6,1.3,'Iris-versicolor'),
+(20,5.2,2.7,3.9,1.4,'Iris-versicolor'),
+(21,6.3,3.3,6.0,2.5,'Iris-virginica'),
+(22,5.8,2.7,5.1,1.9,'Iris-virginica'),
+(23,7.1,3.0,5.9,2.1,'Iris-virginica'),
+(24,6.3,2.9,5.6,1.8,'Iris-virginica'),
+(25,6.5,3.0,5.8,2.2,'Iris-virginica'),
+(26,7.6,3.0,6.6,2.1,'Iris-virginica'),
+(27,4.9,2.5,4.5,1.7,'Iris-virginica'),
+(28,7.3,2.9,6.3,1.8,'Iris-virginica'),
+(29,6.7,2.5,5.8,1.8,'Iris-virginica'),
+(30,7.2,3.6,6.1,2.5,'Iris-virginica');
+</pre></li>
+<li>Run the <b><a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a></b> function using all defaults. <pre class="example">
+DROP TABLE IF EXISTS iris_summary;
+SELECT * FROM madlib.summary( 'iris',            -- Source table
+                              'iris_summary'     -- Output table
+                            );
+</pre> Result: <pre class="result">
+ output_table | row_count |      duration       
+--------------+-----------+---------------------
+ iris_summary |         6 | 0.00712704658508301
+(1 row)
+</pre> View the summary data. <pre class="example">
+-- Turn on expanded display for readability.
+\x on
+SELECT * FROM iris_summary;
+</pre> Result (partial): <pre class="result">
+...
+&#160;-[ RECORD 2 ]-------+-----------------------------------
+group_by             | 
+group_by_value       | 
+target_column        | sepal_length
+column_number        | 2
+data_type            | float8
+row_count            | 30
+distinct_values      | 22
+missing_values       | 0
+blank_values         | 
+fraction_missing     | 0
+fraction_blank       | 
+mean                 | 5.84333333333333
+variance             | 0.9294367816092
+min                  | 4.4
+max                  | 7.6
+first_quartile       | 4.925
+median               | 5.75
+third_quartile       | 6.575
+most_frequent_values | {4.9,6.3,6.5,4.6,5,6.9,5.4,4.4,7,6.4}
+mfv_frequencies      | {4,3,2,2,2,1,1,1,1,1}  
+...
+&#160;-[ RECORD 6 ]-------+-----------------------------------
+group_by             | 
+group_by_value       | 
+target_column        | class_name
+column_number        | 6
+data_type            | text
+row_count            | 30
+distinct_values      | 3
+missing_values       | 0
+blank_values         | 0
+fraction_missing     | 0
+fraction_blank       | 0
+mean                 | 
+variance             | 
+min                  | 11
+max                  | 15
+first_quartile       | 
+median               | 
+third_quartile       | 
+most_frequent_values | {Iris-setosa,Iris-versicolor,Iris-virginica}
+mfv_frequencies      | {10,10,10}
+</pre> Note that for the text column in record 6, some statistics are n/a, and the min and max values represent the length of the shortest and longest strings respectively.</li>
+<li>Now group by the class of iris: <pre class="example">
+DROP TABLE IF EXISTS iris_summary;
+SELECT * FROM madlib.summary( 'iris',                       -- Source table
+                              'iris_summary',               -- Output table
+                              'sepal_length, sepal_width',  -- Columns to summarize
+                              'class_name'                  -- Grouping column
+                            );
+SELECT * FROM iris_summary;
+</pre> Result (partial): <pre class="result">
+&#160;-[ RECORD 1 ]-------+-----------------------------------
+group_by             | class_name
+group_by_value       | Iris-setosa
+target_column        | sepal_length
+column_number        | 2
+data_type            | float8
+row_count            | 10
+distinct_values      | 7
+missing_values       | 0
+blank_values         | 
+fraction_missing     | 0
+fraction_blank       | 
+mean                 | 4.86
+variance             | 0.0848888888888976
+min                  | 4.4
+max                  | 5.4
+first_quartile       | 4.625
+median               | 4.9
+third_quartile       | 5
+most_frequent_values | {4.6,4.9,5,5.1,4.4,5.4,4.7}
+mfv_frequencies      | {2,2,2,1,1,1,1}
+...
+&#160;-[ RECORD 3 ]-------+-----------------------------------
+group_by             | class_name
+group_by_value       | Iris-versicolor
+target_column        | sepal_length
+column_number        | 2
+data_type            | float8
+row_count            | 10
+distinct_values      | 10
+missing_values       | 0
+blank_values         | 
+fraction_missing     | 0
+fraction_blank       | 
+mean                 | 6.1
+variance             | 0.528888888888893
+min                  | 4.9
+max                  | 7
+first_quartile       | 5.55
+median               | 6.35
+third_quartile       | 6.575
+most_frequent_values | {7,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2}
+mfv_frequencies      | {1,1,1,1,1,1,1,1,1,1}
+...
+</pre></li>
+<li>Trying some other parameters: <pre class="example">
+DROP TABLE IF EXISTS iris_summary;
+SELECT * FROM madlib.summary( 'iris',                       -- Source table
+                              'iris_summary',               -- Output table
+                              'sepal_length, sepal_width',  -- Columns to summarize
+                               NULL,                        -- No grouping
+                               TRUE,                        -- Get distinct values
+                               FALSE,                       -- Dont get quartiles
+                               ARRAY[0.33, 0.66],           -- Get ntiles
+                               3,                           -- Number of MFV to compute
+                               FALSE                        -- Get exact values
+                            );
+SELECT * FROM iris_summary;
+</pre> Result: <pre class="result">
+&#160;-[ RECORD 1 ]-------+-----------------------------------
+group_by             | 
+group_by_value       | 
+target_column        | sepal_length
+column_number        | 2
+data_type            | float8
+row_count            | 30
+distinct_values      | 22
+missing_values       | 0
+blank_values         | 
+fraction_missing     | 0
+fraction_blank       | 
+mean                 | 5.84333333333333
+variance             | 0.9294367816092
+min                  | 4.4
+max                  | 7.6
+quantile_array       | {5.057,6.414}
+most_frequent_values | {4.9,6.3,5}
+mfv_frequencies      | {4,3,2}
+&#160;-[ RECORD 2 ]-------+-----------------------------------
+group_by             | 
+group_by_value       | 
+target_column        | sepal_width
+column_number        | 3
+data_type            | float8
+row_count            | 30
+distinct_values      | 14
+missing_values       | 0
+blank_values         | 
+fraction_missing     | 0
+fraction_blank       | 
+mean                 | 3.04
+variance             | 0.13903448275862
+min                  | 2.3
+max                  | 3.9
+quantile_array       | {2.9,3.2}
+most_frequent_values | {3,2.9,3.2}
+mfv_frequencies      | {4,4,3}
+</pre></li>
+</ol>
+<p><a class="anchor" id="notes"></a></p><dl class="section user"><dt>Notes</dt><dd><ul>
+<li>Table names can be optionally schema qualified (current_schemas() would be searched if a schema name is not provided) and table and column names should follow case-sensitivity and quoting rules per the database. (For instance, 'mytable' and 'MyTable' both resolve to the same entity, i.e. 'mytable'. If mixed-case or multi-byte characters are desired for entity names then the string should be double-quoted; in this case the input would be '"MyTable"').</li>
+<li>The <em>get_estimates</em> parameter controls computation for both distinct count and most frequent values:<ul>
+<li>If <em>get_estimates</em> is TRUE then the distinct value computation is estimated using Flajolet-Martin. MFV is computed using a fast method that does parallel aggregation in Greenplum Database at the expense of missing some of the most frequent values.</li>
+<li>If <em>get_estimates</em> is FALSE then the distinct values are computed in a slower but exact method using PostgreSQL COUNT DISTINCT. MFV is computed using a faithful implementation that preserves the approximation guarantees of the Cormode/Muthukrishnan method (more information at <a class="el" href="group__grp__mfvsketch.html">MFV (Most Frequent Values)</a>).</li>
+</ul>
+</li>
+</ul>
+</dd></dl>
+<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd>File <a class="el" href="summary_8sql__in.html" title="Summary function for descriptive statistics. ">summary.sql_in</a> documenting the <b><a class="el" href="summary_8sql__in.html#a4be51e88a1df45191a1692b95429af36">summary()</a></b> function</dd></dl>
+<p><a class="el" href="group__grp__fmsketch.html">FM (Flajolet-Martin)</a> <br />
+ <a class="el" href="group__grp__mfvsketch.html">MFV (Most Frequent Values)</a> <br />
+ <a class="el" href="group__grp__countmin.html">CountMin (Cormode-Muthukrishnan)</a> </p>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Wed Dec 27 2017 19:05:57 for MADlib by
+    <a href="http://www.doxygen.org/index.html">
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
+  </ul>
+</div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6c103d3e/docs/v1.13/group__grp__super.html
----------------------------------------------------------------------
diff --git a/docs/v1.13/group__grp__super.html b/docs/v1.13/group__grp__super.html
new file mode 100644
index 0000000..16517b0
--- /dev/null
+++ b/docs/v1.13/group__grp__super.html
@@ -0,0 +1,149 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.13"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
+<title>MADlib: Supervised Learning</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+  $(document).ready(initResizable);
+</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.13</span>
+   </div>
+   <div id="projectbrief">User Documentation for MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.13 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+$(document).ready(function(){initNavTree('group__grp__super.html','');});
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="summary">
+<a href="#groups">Modules</a>  </div>
+  <div class="headertitle">
+<div class="title">Supervised Learning</div>  </div>
+</div><!--header-->
+<div class="contents">
+<a name="details" id="details"></a><h2 class="groupheader">Detailed Description</h2>
+<p>Contains methods which perform supervised learning tasks </p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="groups"></a>
+Modules</h2></td></tr>
+<tr class="memitem:group__grp__crf"><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="group__grp__crf.html">Conditional Random Field</a></td></tr>
+<tr class="memdesc:group__grp__crf"><td class="mdescLeft">&#160;</td><td class="mdescRight">Constructs a Conditional Random Fields (CRF) model for labeling sequential data. <br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:group__grp__nn"><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="group__grp__nn.html">Neural Network</a></td></tr>
+<tr class="memdesc:group__grp__nn"><td class="mdescLeft">&#160;</td><td class="mdescRight">Solves classification and regression problems with several fully connected layers and non-linear activation functions. <br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:group__grp__regml"><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="group__grp__regml.html">Regression Models</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:group__grp__svm"><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="group__grp__svm.html">Support Vector Machines</a></td></tr>
+<tr class="memdesc:group__grp__svm"><td class="mdescLeft">&#160;</td><td class="mdescRight">Solves classification and regression problems by separating data with a hyperplane or other nonlinear decision boundary. <br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:group__grp__tree"><td class="memItemLeft" align="right" valign="top">&#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="group__grp__tree.html">Tree Methods</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Wed Dec 27 2017 19:05:57 for MADlib by
+    <a href="http://www.doxygen.org/index.html">
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
+  </ul>
+</div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6c103d3e/docs/v1.13/group__grp__super.js
----------------------------------------------------------------------
diff --git a/docs/v1.13/group__grp__super.js b/docs/v1.13/group__grp__super.js
new file mode 100644
index 0000000..c36abae
--- /dev/null
+++ b/docs/v1.13/group__grp__super.js
@@ -0,0 +1,8 @@
+var group__grp__super =
+[
+    [ "Conditional Random Field", "group__grp__crf.html", null ],
+    [ "Neural Network", "group__grp__nn.html", null ],
+    [ "Regression Models", "group__grp__regml.html", "group__grp__regml" ],
+    [ "Support Vector Machines", "group__grp__svm.html", null ],
+    [ "Tree Methods", "group__grp__tree.html", "group__grp__tree" ]
+];
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6c103d3e/docs/v1.13/group__grp__svd.html
----------------------------------------------------------------------
diff --git a/docs/v1.13/group__grp__svd.html b/docs/v1.13/group__grp__svd.html
new file mode 100644
index 0000000..2de377a
--- /dev/null
+++ b/docs/v1.13/group__grp__svd.html
@@ -0,0 +1,417 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.13"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
+<title>MADlib: Singular Value Decomposition</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+  $(document).ready(initResizable);
+</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.13</span>
+   </div>
+   <div id="projectbrief">User Documentation for MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.13 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+$(document).ready(function(){initNavTree('group__grp__svd.html','');});
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Singular Value Decomposition<div class="ingroups"><a class="el" href="group__grp__datatrans.html">Data Types and Transformations</a> &raquo; <a class="el" href="group__grp__arraysmatrix.html">Arrays and Matrices</a> &raquo; <a class="el" href="group__grp__matrix__factorization.html">Matrix Factorization</a></div></div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="toc"><b>Contents</b> <ul>
+<li>
+<a href="#syntax">SVD Functions</a> </li>
+<li>
+<a href="#output">Output Tables</a> </li>
+<li>
+<a href="#examples">Examples</a></li>
+<li>
+</li>
+<li>
+<a href="#background">Technical Background</a> </li>
+</ul>
+</div><p>In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics.</p>
+<p>Let \(A\) be a \(mxn\) matrix, where \(m \ge n\). Then \(A\) can be decomposed as follows: </p><p class="formulaDsp">
+\[ A = U \Sigma V^T, \]
+</p>
+<p> where \(U\) is a \(m \times n\) orthonormal matrix, \(\Sigma\) is a \(n \times n\) diagonal matrix, and \(V\) is an \(n \times n\) orthonormal matrix. The diagonal elements of \(\Sigma\) are called the <em>singular values</em>.</p>
+<p><a class="anchor" id="syntax"></a></p><dl class="section user"><dt>SVD Functions</dt><dd></dd></dl>
+<p>SVD factorizations are provided for dense and sparse matrices. In addition, a native implementation is provided for very sparse matrices for improved performance.</p>
+<p><b>SVD Function for Dense Matrices</b></p>
+<pre class="syntax">
+svd( source_table,
+     output_table_prefix,
+     row_id,
+     k,
+     n_iterations,
+     result_summary_table
+);
+</pre><p> <b>Arguments</b> </p><dl class="arglist">
+<dt>source_table </dt>
+<dd><p class="startdd">TEXT. Source table name (dense matrix).</p>
+<p class="enddd">The table contains a <code>row_id</code> column that identifies each row, with numbering starting from 1. The other columns contain the data for the matrix. There are two possible dense formats as illustrated by the 2x2 matrix example below. You can use either of these dense formats:</p><ol type="1">
+<li><pre class="example">
+            row_id     col1     col2
+row1         1           1         0
+row2         2           0         1
+    </pre></li>
+<li><pre class="example">
+        row_id     row_vec
+row1        1       {1, 0}
+row2        2       {0, 1}
+    </pre>  </li>
+</ol>
+</dd>
+<dt>output_table_prefix </dt>
+<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used. </dd>
+<dt>row_id </dt>
+<dd>TEXT. ID for each row. </dd>
+<dt>k </dt>
+<dd>INTEGER. Number of singular values to compute. </dd>
+<dt>n_iterations (optional).  </dt>
+<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
+</dd>
+<dt>result_summary_table (optional) </dt>
+<dd>TEXT. The name of the table to store the result summary. </dd>
+</dl>
+<hr/>
+<p> <b>SVD Function for Sparse Matrices</b></p>
+<p>Use this function for matrices that are represented in the sparse-matrix format (example below). <b>Note that the input matrix is converted to a dense matrix before the SVD operation, for efficient computation reasons. </b></p>
+<pre class="syntax">
+svd_sparse( source_table,
+            output_table_prefix,
+            row_id,
+            col_id,
+            value,
+            row_dim,
+            col_dim,
+            k,
+            n_iterations,
+            result_summary_table
+          );
+</pre><p> <b>Arguments</b> </p><dl class="arglist">
+<dt>source_table </dt>
+<dd><p class="startdd">TEXT. Source table name (sparse matrix).</p>
+<p>A sparse matrix is represented using the row and column indices for each non-zero entry of the matrix. This representation is useful for matrices containing multiple zero elements. Below is an example of a sparse 4x7 matrix with just 6 out of 28 entries being non-zero. The dimensionality of the matrix is inferred using the max value in <em>row</em> and <em>col</em> columns. Note the last entry is included (even though it is 0) to provide the dimensionality of the matrix, indicating that the 4th row and 7th column contain all zeros. </p><pre class="example">
+ row_id | col_id | value
+--------+--------+-------
+      1 |      1 |     9
+      1 |      5 |     6
+      1 |      6 |     6
+      2 |      1 |     8
+      3 |      1 |     3
+      3 |      2 |     9
+      4 |      7 |     0
+(6 rows)
+</pre> <p class="enddd"></p>
+</dd>
+<dt>output_table_prefix </dt>
+<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used.  </dd>
+<dt>row_id </dt>
+<dd>TEXT. Name of the column containing the row index for each entry in sparse matrix. </dd>
+<dt>col_id </dt>
+<dd>TEXT. Name of the column containing the column index for each entry in sparse matrix. </dd>
+<dt>value </dt>
+<dd>TEXT. Name of column containing the non-zero values of the sparse matrix. </dd>
+<dt>row_dim </dt>
+<dd>INTEGER. Number of rows in matrix. </dd>
+<dt>col_dim </dt>
+<dd>INTEGER. Number of columns in matrix. </dd>
+<dt>k </dt>
+<dd>INTEGER. Number of singular values to compute. </dd>
+<dt>n_iterations (optional) </dt>
+<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
+</dd>
+<dt>result_summary_table (optional) </dt>
+<dd>TEXT. The name of the table to store the result summary. </dd>
+</dl>
+<hr/>
+<p> <b>Native Implementation for Sparse Matrices</b></p>
+<p>Use this function for matrices that are represented in the sparse-matrix format (see sparse matrix example above). This function uses the native sparse representation while computing the SVD. </p><dl class="section note"><dt>Note</dt><dd>Note that this function should be favored if the matrix is highly sparse, since it computes very sparse matrices efficiently. </dd></dl>
+<pre class="syntax">
+svd_sparse_native( source_table,
+                   output_table_prefix,
+                   row_id,
+                   col_id,
+                   value,
+                   row_dim,
+                   col_dim,
+                   k,
+                   n_iterations,
+                   result_summary_table
+                 );
+</pre><p> <b>Arguments</b> </p><dl class="arglist">
+<dt>source_table </dt>
+<dd>TEXT. Source table name (sparse matrix - see example above). </dd>
+<dt>output_table_prefix </dt>
+<dd>TEXT. Prefix for output tables. See <a href="#output">Output Tables</a> below for a description of the convention used. </dd>
+<dt>row_id </dt>
+<dd>TEXT. ID for each row. </dd>
+<dt>col_id </dt>
+<dd>TEXT. ID for each column. </dd>
+<dt>value </dt>
+<dd>TEXT. Non-zero values of the sparse matrix. </dd>
+<dt>row_dim </dt>
+<dd>INTEGER. Row dimension of sparse matrix. </dd>
+<dt>col_dim </dt>
+<dd>INTEGER. Col dimension of sparse matrix. </dd>
+<dt>k </dt>
+<dd>INTEGER. Number of singular values to compute. </dd>
+<dt>n_iterations (optional) </dt>
+<dd>INTEGER. Number of iterations to run. <dl class="section note"><dt>Note</dt><dd>The number of iterations must be in the range [k, column dimension], where k is number of singular values. </dd></dl>
+</dd>
+<dt>result_summary_table (optional) </dt>
+<dd>TEXT. Table name to store result summary. </dd>
+</dl>
+<hr/>
+<p><a class="anchor" id="output"></a></p><dl class="section user"><dt>Output Tables</dt><dd></dd></dl>
+<p>Output for eigenvectors/values is in the following three tables:</p><ul>
+<li>Left singular matrix: Table is named &lt;output_table_prefix&gt;_u (e.g. ‘netflix_u’)</li>
+<li>Right singular matrix: Table is named &lt;output_table_prefix&gt;_v (e.g. ‘netflix_v’)</li>
+<li>Singular values: Table is named &lt;output_table_prefix&gt;_s (e.g. ‘netflix_s’)</li>
+</ul>
+<p>The left and right singular vector tables are of the format: </p><table class="output">
+<tr>
+<th>row_id </th><td>INTEGER. The ID corresponding to each eigenvalue (in decreasing order).  </td></tr>
+<tr>
+<th>row_vec </th><td>FLOAT8[]. Singular vector elements for this row_id. Each array is of size k.  </td></tr>
+</table>
+<p>The singular values table is in sparse table format, since only the diagonal elements of the matrix are non-zero: </p><table class="output">
+<tr>
+<th>row_id </th><td>INTEGER. <em>i</em> for <em>ith</em> eigenvalue.  </td></tr>
+<tr>
+<th>col_id </th><td>INTEGER. <em>i</em> for <em>ith</em> eigenvalue (same as row_id).  </td></tr>
+<tr>
+<th>value </th><td>FLOAT8. Eigenvalue.  </td></tr>
+</table>
+<p>All <code>row_id</code> and <code>col_id</code> in the above tables start from 1.</p>
+<p>The result summary table has the following columns: </p><table class="output">
+<tr>
+<th>rows_used </th><td>INTEGER. Number of rows used for SVD calculation.  </td></tr>
+<tr>
+<th>exec_time </th><td>FLOAT8. Total time for executing SVD.  </td></tr>
+<tr>
+<th>iter </th><td>INTEGER. Total number of iterations run.  </td></tr>
+<tr>
+<th>recon_error </th><td>FLOAT8. Total quality score (i.e. approximation quality) for this set of orthonormal basis.  </td></tr>
+<tr>
+<th>relative_recon_error </th><td>FLOAT8. Relative quality score.  </td></tr>
+</table>
+<p>In the result summary table, the reconstruction error is computed as \( \sqrt{mean((X - USV^T)_{ij}^2)} \), where the average is over all elements of the matrices. The relative reconstruction error is then computed as ratio of the reconstruction error and \( \sqrt{mean(X_{ij}^2)} \).</p>
+<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
+<ol type="1">
+<li>View online help for the SVD function. <pre class="example">
+SELECT madlib.svd();
+</pre></li>
+<li>Create an input dataset (dense matrix). <pre class="example">
+DROP TABLE IF EXISTS mat, mat_sparse, svd_summary_table, svd_u, svd_v, svd_s;
+CREATE TABLE mat (
+    row_id integer,
+    row_vec double precision[]
+);
+INSERT INTO mat VALUES
+(1,'{396,840,353,446,318,886,15,584,159,383}'),
+(2,'{691,58,899,163,159,533,604,582,269,390}'), 
+(3,'{293,742,298,75,404,857,941,662,846,2}'),
+(4,'{462,532,787,265,982,306,600,608,212,885}'),
+(5,'{304,151,337,387,643,753,603,531,459,652}'),
+(6,'{327,946,368,943,7,516,272,24,591,204}'),
+(7,'{877,59,260,302,891,498,710,286,864,675}'),
+(8,'{458,959,774,376,228,354,300,669,718,565}'),
+(9,'{824,390,818,844,180,943,424,520,65,913}'),
+(10,'{882,761,398,688,761,405,125,484,222,873}'),
+(11,'{528,1,860,18,814,242,314,965,935,809}'),
+(12,'{492,220,576,289,321,261,173,1,44,241}'),
+(13,'{415,701,221,503,67,393,479,218,219,916}'),
+(14,'{350,192,211,633,53,783,30,444,176,932}'),
+(15,'{909,472,871,695,930,455,398,893,693,838}'),
+(16,'{739,651,678,577,273,935,661,47,373,618}');
+</pre></li>
+<li>Run SVD function for a dense matrix. <pre class="example">
+SELECT madlib.svd( 'mat',       -- Input table
+                   'svd',       -- Output table prefix
+                   'row_id',    -- Column name with row index 
+                   10,          -- Number of singular values to compute
+                   NULL,        -- Use default number of iterations
+                   'svd_summary_table'  -- Result summary table
+                 );
+</pre></li>
+<li>Print out the singular values and the summary table. For the singular values: <pre class="example">
+SELECT * FROM svd_s ORDER BY row_id;
+</pre> Result: <pre class="result">
+ row_id | col_id |      value       
+&#160;--------+--------+------------------
+      1 |      1 | 6475.67225281804
+      2 |      2 | 1875.18065580415
+      3 |      3 | 1483.25228429636
+      4 |      4 | 1159.72262897427
+      5 |      5 | 1033.86092570574
+      6 |      6 | 948.437358703966
+      7 |      7 | 795.379572772455
+      8 |      8 | 709.086240684469
+      9 |      9 | 462.473775959371
+     10 |     10 | 365.875217945698
+     10 |     10 |                 
+(11 rows)
+</pre> For the summary table: <pre class="example">
+SELECT * FROM svd_summary_table;
+</pre> Result: <pre class="result">
+ rows_used | exec_time (ms) | iter |    recon_error    | relative_recon_error 
+&#160;-----------+----------------+------+-------------------+----------------------
+        16 |        1332.47 |   10 | 4.36920148766e-13 |    7.63134130332e-16
+(1 row)
+</pre></li>
+<li>Create a sparse matrix by running the <a class="el" href="matrix__ops_8sql__in.html#a390fb7234f49e17c780e961184873759">matrix_sparsify()</a> utility function on the dense matrix. <pre class="example">
+SELECT madlib.matrix_sparsify('mat', 
+                              'row=row_id, val=row_vec',
+                              'mat_sparse',
+                              'row=row_id, col=col_id, val=value');
+</pre></li>
+<li>Run the SVD function for a sparse matrix. <pre class="example">
+SELECT madlib.svd_sparse( 'mat_sparse',   -- Input table
+                          'svd',          -- Output table prefix
+                          'row_id',       -- Column name with row index 
+                          'col_id',       -- Column name with column index 
+                          'value',        -- Matrix cell value
+                          16,             -- Number of rows in matrix
+                          10,             -- Number of columns in matrix    
+                          10              -- Number of singular values to compute
+                          );
+</pre></li>
+<li>Run the SVD function for a very sparse matrix. <pre class="example">
+SELECT madlib.svd_sparse_native ( 'mat_sparse',   -- Input table
+                          'svd',          -- Output table prefix
+                          'row_id',       -- Column name with row index 
+                          'col_id',       -- Column name with column index 
+                          'value',        -- Matrix cell value
+                          16,             -- Number of rows in matrix
+                          10,             -- Number of columns in matrix    
+                          10              -- Number of singular values to compute
+                          );
+</pre> <a class="anchor" id="background"></a><dl class="section user"><dt>Technical Background</dt><dd>In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics. Let \(A\) be a \(m \times n\) matrix, where \(m \ge n\). Then \(A\) can be decomposed as follows: <p class="formulaDsp">
+\[ A = U \Sigma V^T, \]
+</p>
+ where \(U\) is a \(m \times n\) orthonormal matrix, \(\Sigma\) is a \(n \times n\) diagonal matrix, and \(V\) is an \(n \times n\) orthonormal matrix. The diagonal elements of \(\Sigma\) are called the <em>singular values</em>. It is possible to formulate the problem of computing the singular triplets ( \(\sigma_i, u_i, v_i\)) of \(A\) as an eigenvalue problem involving a Hermitian matrix related to \(A\). There are two possible ways of achieving this:</dd></dl>
+</li>
+</ol>
+<ul>
+<li>With the cross product matrix, \(A^TA\) and \(AA^T\)</li>
+<li>With the cyclic matrix <p class="formulaDsp">
+\[ H(A) = \begin{bmatrix} 0 &amp; A\\ A^* &amp; 0 \end{bmatrix} \]
+</p>
+ The singular values are the nonnegative square roots of the eigenvalues of the cross product matrix. This approach may imply a severe loss of accuracy in the smallest singular values. The cyclic matrix approach is an alternative that avoids this problem, but at the expense of significantly increasing the cost of the computation. Computing the cross product matrix explicitly is not recommended, especially in the case of sparse A. Bidiagonalization was proposed by Golub and Kahan [citation?] as a way of tridiagonalizing the cross product matrix without forming it explicitly. Consider the following decomposition <p class="formulaDsp">
+\[ A = P B Q^T, \]
+</p>
+ where \(P\) and \(Q\) are unitary matrices and \(B\) is an \(m \times n\) upper bidiagonal matrix. Then the tridiagonal matrix \(B*B\) is unitarily similar to \(A*A\). Additionally, specific methods exist that compute the singular values of \(B\) without forming \(B*B\). Therefore, after computing the SVD of B, <p class="formulaDsp">
+\[ B = X\Sigma Y^T, \]
+</p>
+ it only remains to compute the SVD of the original matrix with \(U = PX\) and \(V = QY\). </li>
+</ul>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Wed Dec 27 2017 19:05:57 for MADlib by
+    <a href="http://www.doxygen.org/index.html">
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
+  </ul>
+</div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6c103d3e/docs/v1.13/group__grp__svec.html
----------------------------------------------------------------------
diff --git a/docs/v1.13/group__grp__svec.html b/docs/v1.13/group__grp__svec.html
new file mode 100644
index 0000000..f2a8fd3
--- /dev/null
+++ b/docs/v1.13/group__grp__svec.html
@@ -0,0 +1,448 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.13"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data mining,deep learning,ensemble methods,data science,market basket analysis,affinity analysis,pca,lda,regression,elastic net,huber white,proportional hazards,k-means,latent dirichlet allocation,bayes,support vector machines,svm"/>
+<title>MADlib: Sparse Vectors</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+  $(document).ready(initResizable);
+</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org"><img alt="Logo" src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.13</span>
+   </div>
+   <div id="projectbrief">User Documentation for MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.13 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+$(document).ready(function(){initNavTree('group__grp__svec.html','');});
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Sparse Vectors<div class="ingroups"><a class="el" href="group__grp__datatrans.html">Data Types and Transformations</a> &raquo; <a class="el" href="group__grp__arraysmatrix.html">Arrays and Matrices</a></div></div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="toc"><b>Contents</b> <ul>
+<li>
+<a href="#usage">Using Sparse Vectors</a> </li>
+<li>
+<a href="#vectorization">Document Vectorization into Sparse Vectors</a> </li>
+<li>
+<a href="#examples">Examples</a> </li>
+<li>
+<a href="#related">Related Topics</a> </li>
+</ul>
+</div><p>This module implements a sparse vector data type, named "svec", which provides compressed storage of vectors that have many duplicate elements.</p>
+<p>Arrays of floating point numbers for various calculations sometimes have long runs of zeros (or some other default value). This is common in applications like scientific computing, retail optimization, and text processing. Each floating point number takes 8 bytes of storage in memory and/or disk, so saving those zeros is often worthwhile. There are also many computations that can benefit from skipping over the zeros.</p>
+<p>Consider, for example, the following array of doubles stored as a Postgres/Greenplum "float8[]" data type:</p>
+<pre class="example">
+'{0, 33,...40,000 zeros..., 12, 22 }'::float8[]
+</pre><p>This array would occupy slightly more than 320KB of memory or disk, most of it zeros. Even if we were to exploit the null bitmap and store the zeros as nulls, we would still end up with a 5KB null bitmap, which is still not nearly as memory efficient as we'd like. Also, as we perform various operations on the array, we do work on 40,000 fields that turn out to be unimportant.</p>
+<p>To solve the problems associated with the processing of vectors discussed above, the svec type employs a simple Run Length Encoding (RLE) scheme to represent sparse vectors as pairs of count-value arrays. For example, the array above would be represented as</p>
+<pre class="example">
+'{1,1,40000,1,1}:{0,33,0,12,22}'::madlib.svec
+</pre><p>which says there is 1 occurrence of 0, followed by 1 occurrence of 33, followed by 40,000 occurrences of 0, etc. This uses just 5 integers and 5 floating point numbers to store the array. Further, it is easy to implement vector operations that can take advantage of the RLE representation to make computations faster. The SVEC module provides a library of such functions.</p>
+<p>The current version only supports sparse vectors of float8 values. Future versions will support other base types.</p>
+<p><a class="anchor" id="usage"></a></p><dl class="section user"><dt>Using Sparse Vectors</dt><dd></dd></dl>
+<p>An SVEC can be constructed directly with a constant expression, as follows: </p><pre class="example">
+SELECT '{n1,n2,...,nk}:{v1,v2,...vk}'::madlib.svec;
+</pre><p> where <code>n1,n2,...,nk</code> specifies the counts for the values <code>v1,v2,...,vk</code>.</p>
+<p>A float array can be cast to an SVEC: </p><pre class="example">
+SELECT ('{v1,v2,...vk}'::float[])::madlib.svec;
+</pre><p>An SVEC can be created with an aggregation: </p><pre class="example">
+SELECT madlib.svec_agg(v1) FROM generate_series(1,k);
+</pre><p>An SVEC can be created using the <code>madlib.svec_cast_positions_float8arr()</code> function by supplying an array of positions and an array of values at those positions: </p><pre class="example">
+SELECT madlib.svec_cast_positions_float8arr(
+    array[n1,n2,...nk],    -- positions of values in vector
+    array[v1,v2,...vk],    -- values at each position
+    length,                -- length of vector
+    base)                  -- value at unspecified positions
+</pre><p> For example, the following expression: </p><pre class="example">
+SELECT madlib.svec_cast_positions_float8arr(
+    array[1,3,5],
+    array[2,4,6],
+    10,
+    0.0)
+</pre><p> produces this SVEC: </p><pre class="result">
+ svec_cast_positions_float8arr
+ &#160;------------------------------
+ {1,1,1,1,1,5}:{2,0,4,0,6,0}
+</pre><p>Add madlib to the search_path to use the svec operators defined in the module.</p>
+<p><a class="anchor" id="vectorization"></a></p><dl class="section user"><dt>Document Vectorization into Sparse Vectors</dt><dd>This module implements an efficient way for document vectorization, converting text documents into sparse vector representation (MADlib.svec), required by various machine learning algorithms in MADlib.</dd></dl>
+<p>The function accepts two tables as input, dictionary table and documents table, and produces the specified output table containing sparse vectors for the represented documents (in documents table).</p>
+<pre class="syntax">
+madlib.gen_doc_svecs(output_tbl,
+                     dictionary_tbl,
+                     dict_id_col,
+                     dict_term_col,
+                     documents_tbl,
+                     doc_id_col,
+                     doc_term_col,
+                     doc_term_info_col
+                    )
+</pre><p> <b>Arguments</b> </p><dl class="arglist">
+<dt>output_tbl </dt>
+<dd><p class="startdd">TEXT. Name of the output table to be created containing the sparse vector representation of the documents. It has the following columns: </p><table class="output">
+<tr>
+<th>doc_id </th><td>__TYPE_DOC__. Document id. <br />
+ __TYPE_DOC__: Column type depends on the type of <code>doc_id_col</code> in <code>documents_tbl</code>.   </td></tr>
+<tr>
+<th>sparse_vector </th><td>MADlib.svec. Corresponding sparse vector representation.  </td></tr>
+</table>
+<p class="enddd"></p>
+</dd>
+<dt>dictionary_tbl </dt>
+<dd><p class="startdd">TEXT. Name of the dictionary table containing features. </p><table class="output">
+<tr>
+<th>dict_id_col </th><td>TEXT. Name of the id column in the <code>dictionary_tbl</code>. <br />
+ Expected Type: INTEGER or BIGINT. <br />
+ NOTE: Values must be continuous ranging from 0 to total number of elements in the dictionary - 1.  </td></tr>
+<tr>
+<th>dict_term_col </th><td>TEXT. Name of the column containing term (features) in <code>dictionary_tbl</code>.  </td></tr>
+</table>
+<p class="enddd"></p>
+</dd>
+<dt>documents_tbl </dt>
+<dd>TEXT. Name of the documents table representing documents. <table class="output">
+<tr>
+<th>doc_id_col </th><td>TEXT. Name of the id column in the <code>documents_tbl</code>.  </td></tr>
+<tr>
+<th>doc_term_col </th><td>TEXT. Name of the term column in the <code>documents_tbl</code>.  </td></tr>
+<tr>
+<th>doc_term_info_col </th><td>TEXT. Name of the term info column in <code>documents_tbl</code>. The expected type of this column should be: <br />
+ - INTEGER, BIGINT or DOUBLE PRECISION: Values directly used to populate vector. <br />
+ - ARRAY: Length of the array used to populate the vector. <br />
+ ** For an example use case on using these types of column types, please refer to the example below.   </td></tr>
+</table>
+</dd>
+</dl>
+<p><b>Example:</b> <br />
+ Consider a corpus consisting of set of documents consisting of features (terms) along with doc ids: </p><pre class="example">
+1, {this,is,one,document,in,the,corpus}
+2, {i,am,the,second,document,in,the,corpus}
+3, {being,third,never,really,bothered,me,until,now}
+4, {the,document,before,me,is,the,third,document}
+</pre><ol type="1">
+<li>Prepare documents table in appropriate format. <br />
+ The corpus specified above can be represented by any of the following <code>documents_table:</code> <pre class="example">
+SELECT * FROM documents_table ORDER BY id;
+</pre> Result: <pre class="result">
+  id |   term   | count                 id |   term   | positions
+&#160;----+----------+-------               ----+----------+-----------
+   1 | is       |     1                  1 | is       | {1}
+   1 | in       |     1                  1 | in       | {4}
+   1 | one      |     1                  1 | one      | {2}
+   1 | this     |     1                  1 | this     | {0}
+   1 | the      |     1                  1 | the      | {5}
+   1 | document |     1                  1 | document | {3}
+   1 | corpus   |     1                  1 | corpus   | {6}
+   2 | second   |     1                  2 | second   | {3}
+   2 | document |     1                  2 | document | {4}
+   2 | corpus   |     1                  2 | corpus   | {7}
+   . | ...      |    ..                  . | ...      | ...
+   4 | document |     2                  4 | document | {1,7}
+...
+</pre></li>
+<li>Prepare dictionary table in appropriate format. <pre class="example">
+SELECT * FROM dictionary_table ORDER BY id;
+</pre> Result: <pre class="result">
+  id |   term
+&#160;----+----------
+   0 | am
+   1 | before
+   2 | being
+   3 | bothered
+   4 | corpus
+   5 | document
+   6 | i
+   7 | in
+   8 | is
+   9 | me
+...
+</pre></li>
+<li>Generate sparse vector for the documents using dictionary_table and documents_table. <br />
+ <code>doc_term_info_col</code> <code></code>(count) of type INTEGER: <pre class="example">
+SELECT * FROM madlib.gen_doc_svecs('svec_output', 'dictionary_table', 'id', 'term',
+                            'documents_table', 'id', 'term', 'count');
+</pre> <code>doc_term_info_col</code> <code></code>(positions) of type ARRAY: <pre class="example">
+SELECT * FROM madlib.gen_doc_svecs('svec_output', 'dictionary_table', 'id', 'term',
+                            'documents_table', 'id', 'term', 'positions');
+</pre> Result: <pre class="result">
+                                 gen_doc_svecs
+&#160;--------------------------------------------------------------------------------------
+ Created table svec_output (doc_id, sparse_vector) containing sparse vectors
+(1 row)
+</pre></li>
+<li>Analyze the sparse vectors created. <pre class="example">
+SELECT * FROM svec_output ORDER by doc_id;
+</pre> Result: <pre class="result">
+ doc_id |                  sparse_vector
+&#160;--------+-------------------------------------------------
+      1 | {4,2,1,2,3,1,2,1,1,1,1}:{0,1,0,1,0,1,0,1,0,1,0}
+      2 | {1,3,4,6,1,1,3}:{1,0,1,0,1,2,0}
+      3 | {2,2,5,3,1,1,2,1,1,1}:{0,1,0,1,0,1,0,1,0,1}
+      4 | {1,1,3,1,2,2,5,1,1,2}:{0,1,0,2,0,1,0,2,1,0}
+(4 rows)
+</pre></li>
+</ol>
+<p>See the file <a class="el" href="svec_8sql__in.html" title="SQL type definitions and functions for sparse vector data type svec ">svec.sql_in</a> for complete syntax.</p>
+<p><a class="anchor" id="examples"></a></p><dl class="section user"><dt>Examples</dt><dd></dd></dl>
+<p>We can use operations with svec type like &lt;, &gt;, *, **, /, =, +, SUM, etc, and they have meanings associated with typical vector operations. For example, the plus (+) operator adds each of the terms of two vectors having the same dimension together. </p><pre class="example">
+SELECT ('{0,1,5}'::float8[]::madlib.svec + '{4,3,2}'::float8[]::madlib.svec)::float8[];
+</pre><p> Result: </p><pre class="result">
+ float8
+&#160;--------
+ {4,4,7}
+</pre><p>Without the casting into float8[] at the end, we get: </p><pre class="example">
+SELECT '{0,1,5}'::float8[]::madlib.svec + '{4,3,2}'::float8[]::madlib.svec;
+</pre><p> Result: </p><pre class="result">
+ ?column?
+&#160;---------
+{2,1}:{4,7}
+</pre><p>A dot product (%*%) between the two vectors will result in a scalar result of type float8. The dot product should be (0*4 + 1*3 + 5*2) = 13, like this: </p><pre class="example">
+SELECT '{0,1,5}'::float8[]::madlib.svec %*% '{4,3,2}'::float8[]::madlib.svec;
+</pre> <pre class="result">
+ ?column?
+&#160;---------
+    13
+</pre><p>Special vector aggregate functions are also available. SUM is self explanatory. SVEC_COUNT_NONZERO evaluates the count of non-zero terms in each column found in a set of n-dimensional svecs and returns an svec with the counts. For instance, if we have the vectors {0,1,5}, {10,0,3},{0,0,3},{0,1,0}, then executing the SVEC_COUNT_NONZERO() aggregate function would result in {1,2,3}:</p>
+<pre class="example">
+CREATE TABLE list (a madlib.svec);
+INSERT INTO list VALUES ('{0,1,5}'::float8[]), ('{10,0,3}'::float8[]), ('{0,0,3}'::float8[]),('{0,1,0}'::float8[]);
+SELECT madlib.svec_count_nonzero(a)::float8[] FROM list;
+</pre><p> Result: </p><pre class="result">
+svec_count_nonzero
+&#160;----------------
+    {1,2,3}
+</pre><p>We do not use null bitmaps in the svec data type. A null value in an svec is represented explicitly as an NVP (No Value Present) value. For example, we have: </p><pre class="example">
+SELECT '{1,2,3}:{4,null,5}'::madlib.svec;
+</pre><p> Result: </p><pre class="result">
+      svec
+&#160;------------------
+ {1,2,3}:{4,NVP,5}
+</pre><p>Adding svecs with null values results in NVPs in the sum: </p><pre class="example">
+SELECT '{1,2,3}:{4,null,5}'::madlib.svec + '{2,2,2}:{8,9,10}'::madlib.svec;
+</pre><p> Result: </p><pre class="result">
+         ?column?
+ &#160;-------------------------
+  {1,2,1,2}:{12,NVP,14,15}
+</pre><p>An element of an svec can be accessed using the <a class="el" href="svec__util_8sql__in.html#a8787222aec691f94d9808d1369aa401c">svec_proj()</a> function, which takes an svec and the index of the element desired. </p><pre class="example">
+SELECT madlib.svec_proj('{1,2,3}:{4,5,6}'::madlib.svec, 1) + madlib.svec_proj('{4,5,6}:{1,2,3}'::madlib.svec, 15);
+</pre><p> Result: </p><pre class="result"> ?column?
+&#160;---------
+    7
+</pre><p>A subvector of an svec can be accessed using the <a class="el" href="svec__util_8sql__in.html#a5cb3446de5fc117befe88ccb1ebb0e4e">svec_subvec()</a> function, which takes an svec and the start and end index of the subvector desired. </p><pre class="example">
+SELECT madlib.svec_subvec('{2,4,6}:{1,3,5}'::madlib.svec, 2, 11);
+</pre><p> Result: </p><pre class="result">   svec_subvec
+&#160;----------------
+ {1,4,5}:{1,3,5}
+</pre><p>The elements/subvector of an svec can be changed using the function <a class="el" href="svec__util_8sql__in.html#a59407764a1cbf1937da39cf39a2f447c">svec_change()</a>. It takes three arguments: an m-dimensional svec sv1, a start index j, and an n-dimensional svec sv2 such that j + n - 1 &lt;= m, and returns an svec like sv1 but with the subvector sv1[j:j+n-1] replaced by sv2. An example follows: </p><pre class="example">
+SELECT madlib.svec_change('{1,2,3}:{4,5,6}'::madlib.svec,3,'{2}:{3}'::madlib.svec);
+</pre><p> Result: </p><pre class="result">     svec_change
+&#160;--------------------
+ {1,1,2,2}:{4,5,3,6}
+</pre><p>There are also higher-order functions for processing svecs. For example, the following is the corresponding function for lapply() in R. </p><pre class="example">
+SELECT madlib.svec_lapply('sqrt', '{1,2,3}:{4,5,6}'::madlib.svec);
+</pre><p> Result: </p><pre class="result">
+                  svec_lapply
+&#160;----------------------------------------------
+ {1,2,3}:{2,2.23606797749979,2.44948974278318}
+</pre><p>The full list of functions available for operating on svecs are available in svec.sql-in.</p>
+<p><b> A More Extensive Example</b></p>
+<p>For a text classification example, let's assume we have a dictionary composed of words in a sorted text array: </p><pre class="example">
+CREATE TABLE features (a text[]);
+INSERT INTO features VALUES
+            ('{am,before,being,bothered,corpus,document,i,in,is,me,
+               never,now,one,really,second,the,third,this,until}');
+</pre><p> We have a set of documents, each represented as an array of words: </p><pre class="example">
+CREATE TABLE documents(a int,b text[]);
+INSERT INTO documents VALUES
+            (1,'{this,is,one,document,in,the,corpus}'),
+            (2,'{i,am,the,second,document,in,the,corpus}'),
+            (3,'{being,third,never,really,bothered,me,until,now}'),
+            (4,'{the,document,before,me,is,the,third,document}');
+</pre><p>Now we have a dictionary and some documents, we would like to do some document categorization using vector arithmetic on word counts and proportions of dictionary words in each document.</p>
+<p>To start this process, we'll need to find the dictionary words in each document. We'll prepare what is called a Sparse Feature Vector or SFV for each document. An SFV is a vector of dimension N, where N is the number of dictionary words, and in each cell of an SFV is a count of each dictionary word in the document.</p>
+<p>Inside the sparse vector library, we have a function that will create an SFV from a document, so we can just do this (For a more efficient way for converting documents into sparse vectors, especially for larger datasets, please refer to <a href="#vectorization">Document Vectorization into Sparse Vectors</a>):</p>
+<pre class="example">
+SELECT madlib.svec_sfv((SELECT a FROM features LIMIT 1),b)::float8[]
+         FROM documents;
+</pre><p> Result: </p><pre class="result">
+                svec_sfv
+&#160;----------------------------------------
+ {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
+ {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
+ {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
+ {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}
+</pre><p>Note that the output of madlib.svec_sfv() is an svec for each document containing the count of each of the dictionary words in the ordinal positions of the dictionary. This can more easily be understood by lining up the feature vector and text like this:</p>
+<pre class="example">
+SELECT madlib.svec_sfv((SELECT a FROM features LIMIT 1),b)::float8[]
+                , b
+         FROM documents;
+</pre><p> Result: </p><pre class="result">
+                svec_sfv                 |                        b
+&#160;----------------------------------------+--------------------------------------------------
+ {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
+ {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
+ {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
+ {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
+</pre> <pre class="example">
+SELECT * FROM features;
+</pre> <pre class="result">
+                                                a
+&#160;-------------------------------------------------------------------------------------------------------
+{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}
+</pre><p>Now when we look at the document "i am the second document in the corpus", its SFV is {1,3*0,1,1,1,1,6*0,1,2}. The word "am" is the first ordinate in the dictionary and there is 1 instance of it in the SFV. The word "before" has no instances in the document, so its value is "0" and so on.</p>
+<p>The function madlib.svec_sfv() can process large numbers of documents into their SFVs in parallel at high speed.</p>
+<p>The rest of the categorization process is all vector math. The actual count is hardly ever used. Instead, it's turned into a weight. The most common weight is called tf/idf for Term Frequency / Inverse Document Frequency. The calculation for a given term in a given document is</p>
+<pre class="example">
+{#Times in document} * log {#Documents / #Documents the term appears in}.
+</pre><p>For instance, the term "document" in document A would have weight 1 * log (4/3). In document D, it would have weight 2 * log (4/3). Terms that appear in every document would have tf/idf weight 0, since log (4/4) = log(1) = 0. (Our example has no term like that.) That usually sends a lot of values to 0.</p>
+<p>For this part of the processing, we'll need to have a sparse vector of the dictionary dimension (19) with the values </p><pre class="example">
+log(#documents/#Documents each term appears in).
+</pre><p> There will be one such vector for the whole list of documents (aka the "corpus"). The #documents is just a count of all of the documents, in this case 4, but there is one divisor for each dictionary word and its value is the count of all the times that word appears in the document. This single vector for the whole corpus can then be scalar product multiplied by each document SFV to produce the Term Frequency/Inverse Document Frequency weights.</p>
+<p>This can be done as follows: </p><pre class="example">
+CREATE TABLE corpus AS
+            (SELECT a, madlib.svec_sfv((SELECT a FROM features LIMIT 1),b) sfv
+         FROM documents);
+CREATE TABLE weights AS
+          (SELECT a docnum, madlib.svec_mult(sfv, logidf) tf_idf
+           FROM (SELECT madlib.svec_log(madlib.svec_div(count(sfv)::madlib.svec,madlib.svec_count_nonzero(sfv))) logidf
+                FROM corpus) foo, corpus ORDER BYdocnum);
+SELECT * FROM weights;
+</pre><p> Result </p><pre class="result">
+docnum |                tf_idf
+&#160;------+----------------------------------------------------------------------
+     1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.69,0.28,0,0.69,0,1.38,0,0.28,0,1.38,0}
+     2 | {1,3,1,1,1,1,6,1,1,3}:{1.38,0,0.69,0.28,1.38,0.69,0,1.38,0.57,0}
+     3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38,0,0.69,1.38,0,1.38,0,0.69,0,1.38}
+     4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38,0,0.57,0,0.69,0,0.57,0.69,0}
+</pre><p>We can now get the "angular distance" between one document and the rest of the documents using the ACOS of the dot product of the document vectors: The following calculates the angular distance between the first document and each of the other documents: </p><pre class="example">
+SELECT docnum,
+                180. * ( ACOS( madlib.svec_dmin( 1., madlib.svec_dot(tf_idf, testdoc)
+                    / (madlib.svec_l2norm(tf_idf)*madlib.svec_l2norm(testdoc))))/3.141592654) angular_distance
+         FROM weights,(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
+         ORDER BY 1;
+</pre><p> Result: </p><pre class="result">
+docnum | angular_distance
+&#160;-------+------------------
+     1 |                0
+     2 | 78.8235846096986
+     3 | 89.9999999882484
+     4 | 80.0232034288617
+</pre><p>We can see that the angular distance between document 1 and itself is 0 degrees and between document 1 and 3 is 90 degrees because they share no features at all. The angular distance can now be plugged into machine learning algorithms that rely on a distance measure between data points.</p>
+<p>SVEC also provides functionality for declaring array given an array of positions and array of values, intermediate values betweens those are declared to be base value that user provides in the same function call. In the example below the fist array of integers represents the positions for the array two (array of floats). Positions do not need to come in the sorted order. Third value represents desired maximum size of the array. This assures that array is of that size even if last position is not. If max size &lt; 1 that value is ignored and array will end at the last position in the position vector. Final value is a float representing the base value to be used between the declared ones (0 would be a common candidate):</p>
+<pre class="example">
+SELECT madlib.svec_cast_positions_float8arr(ARRAY[1,2,7,5,87],ARRAY[.1,.2,.7,.5,.87],90,0.0);
+</pre><p> Result: </p><pre class="result">
+        svec_cast_positions_float8arr
+&#160;----------------------------------------------------
+{1,1,2,1,1,1,79,1,3}:{0.1,0.2,0,0.5,0,0.7,0,0.87,0}
+(1 row)
+</pre><p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related Topics</dt><dd></dd></dl>
+<p>Other examples of svecs usage can be found in the k-means module, <a class="el" href="group__grp__kmeans.html">k-Means Clustering</a>.</p>
+<p>File <a class="el" href="svec_8sql__in.html" title="SQL type definitions and functions for sparse vector data type svec ">svec.sql_in</a> documenting the SQL functions.</p>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Wed Dec 27 2017 19:05:57 for MADlib by
+    <a href="http://www.doxygen.org/index.html">
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.13 </li>
+  </ul>
+</div>
+</body>
+</html>