You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@phoenix.apache.org by ja...@apache.org on 2014/10/29 06:38:45 UTC

svn commit: r1635051 - in /phoenix: phoenix-docs/src/docsrc/help/ phoenix-docs/src/tools/org/h2/build/doc/ site/publish/ site/publish/presentations/ site/source/src/site/markdown/ site/source/src/site/resources/presentations/

Author: jamestaylor
Date: Wed Oct 29 05:38:43 2014
New Revision: 1635051

URL: http://svn.apache.org/r1635051
Log:
Update docs for 3.2/4.2 and add resources from recent presentations

Added:
    phoenix/site/publish/presentations/HBaseCon2014-16x9.pdf   (with props)
    phoenix/site/publish/presentations/HadoopSummit2014-16x9.pdf   (with props)
    phoenix/site/publish/presentations/OC-HUG-2014-10-4x3.pdf   (with props)
    phoenix/site/source/src/site/resources/presentations/HBaseCon2014-16x9.pdf   (with props)
    phoenix/site/source/src/site/resources/presentations/HadoopSummit2014-16x9.pdf   (with props)
    phoenix/site/source/src/site/resources/presentations/OC-HUG-2014-10-4x3.pdf   (with props)
Modified:
    phoenix/phoenix-docs/src/docsrc/help/phoenix.csv
    phoenix/phoenix-docs/src/tools/org/h2/build/doc/dictionary.txt
    phoenix/site/publish/index.html
    phoenix/site/publish/pig_integration.html
    phoenix/site/publish/resources.html
    phoenix/site/publish/tuning.html
    phoenix/site/publish/update_statistics.html
    phoenix/site/source/src/site/markdown/index.md
    phoenix/site/source/src/site/markdown/resources.md
    phoenix/site/source/src/site/markdown/tuning.md
    phoenix/site/source/src/site/markdown/update_statistics.md

Modified: phoenix/phoenix-docs/src/docsrc/help/phoenix.csv
URL: http://svn.apache.org/viewvc/phoenix/phoenix-docs/src/docsrc/help/phoenix.csv?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/phoenix-docs/src/docsrc/help/phoenix.csv (original)
+++ phoenix/phoenix-docs/src/docsrc/help/phoenix.csv Wed Oct 29 05:38:43 2014
@@ -543,13 +543,15 @@ ID=1 AND NAME='Hi'
 operand [ compare { operand }
     | [ NOT ] IN ( { constantOperand [,...] } )
     | [ NOT ] LIKE operand
+    | [ NOT ] ILIKE operand
     | [ NOT ] BETWEEN operand AND operand
     | IS [ NOT ] NULL ]
     | NOT expression
 ","
 Boolean value or condition.
 When comparing with LIKE, the wildcards characters are ""_"" (any one character)
-and ""%"" (any characters). To search for the characters ""%"" and
+and ""%"" (any characters). ILIKE is the same, but the search is case insensitive.
+To search for the characters ""%"" and
 ""_"", the characters need to be escaped. The escape character is "" \ "" (backslash).
 Patterns that end with an escape character are invalid and the expression returns NULL.
 BETWEEN does an inclusive comparison for both operands.
@@ -1386,6 +1388,15 @@ string. If the replacement string is not
 REGEXP_REPLACE('abc123ABC', '[0-9]+', '#') evaluates to 'abc#ABC'
 "
 
+"Functions (String)","REGEXP_SPLIT","
+REGEXP_SPLIT( stringTerm, patternString )
+","
+Returns a VARCHAR array by splitting the stringTerm based on the Java compatible regular
+expression patternString.
+","
+REGEXP_SPLIT('one,two,three', ',') evaluates to ['one','two','three']
+"
+
 "Functions (General)","MD5","
 MD5( term )
 ","

Modified: phoenix/phoenix-docs/src/tools/org/h2/build/doc/dictionary.txt
URL: http://svn.apache.org/viewvc/phoenix/phoenix-docs/src/tools/org/h2/build/doc/dictionary.txt?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/phoenix-docs/src/tools/org/h2/build/doc/dictionary.txt (original)
+++ phoenix/phoenix-docs/src/tools/org/h2/build/doc/dictionary.txt Wed Oct 29 05:38:43 2014
@@ -726,4 +726,4 @@ coercion coerce coerces bas precise subs
 decisions choosing tiebreaker broadcast substantially unlikely act decision adjacent
 managed declares tenant tenants especially truth determines misspelled salting salted turning adhoc
 rpc doled paranthesis reaching satisfy cocos satisfies pads indian inputting prague
-guideposts collects
+guideposts collects ilike

Modified: phoenix/site/publish/index.html
URL: http://svn.apache.org/viewvc/phoenix/site/publish/index.html?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/publish/index.html (original)
+++ phoenix/site/publish/index.html Wed Oct 29 05:38:43 2014
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2014-10-20
+ Generated by Apache Maven Doxia at 2014-10-28
  Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">
@@ -125,14 +125,14 @@
 <div class="page-header">
  <h1>Overview</h1>
 </div> 
-<p>Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in <a href="performance.html">performance</a> on the order of milliseconds for small queries, or seconds for tens of millions of rows. </p> 
+<p>Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in <a href="performance.html">performance</a> on the order of milliseconds for small queries, or seconds for tens of millions of rows. </p> 
 <div class="section"> 
  <h2 id="Mission">Mission</h2> 
  <p>Become the standard means of accessing HBase data through a well-defined, industry standard API.</p> 
 </div> 
 <div class="section"> 
  <h2 id="Quick_Start">Quick Start</h2> 
- <p>Tired of reading already and just want to get started? Take a look at our <a href="faq.html">FAQs</a>, listen to the Apache Phoenix talks from <a class="externalLink" href="http://www.youtube.com/watch?v=YHsHdQ08trg">Hadoop Summit 2013</a> and <a class="externalLink" href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--how-and-why-phoenix-puts-the-sql-back-into-nosql-video.html">HBaseConn 2013</a>, and jump over to our quick start guide <a href="Phoenix-in-15-minutes-or-less.html">here</a>.</p> 
+ <p>Tired of reading already and just want to get started? Take a look at our <a href="faq.html">FAQs</a>, listen to the Apache Phoenix talks from <a class="externalLink" href="https://www.youtube.com/watch?v=f4Nmh5KM6gI&amp;feature=youtu.be">Hadoop Summit 2014</a>, review the <a class="externalLink" href="http://phoenix.apache.org/presentations/OC-HUG-2014-10-4x3.pdf">overview presentation</a>, and jump over to our quick start guide <a href="Phoenix-in-15-minutes-or-less.html">here</a>.</p> 
 </div> 
 <div class="section"> 
  <h2 id="SQL_Support">SQL Support</h2> 

Modified: phoenix/site/publish/pig_integration.html
URL: http://svn.apache.org/viewvc/phoenix/site/publish/pig_integration.html?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/publish/pig_integration.html (original)
+++ phoenix/site/publish/pig_integration.html Wed Oct 29 05:38:43 2014
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2014-10-20
+ Generated by Apache Maven Doxia at 2014-10-28
  Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">

Added: phoenix/site/publish/presentations/HBaseCon2014-16x9.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/publish/presentations/HBaseCon2014-16x9.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/publish/presentations/HBaseCon2014-16x9.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: phoenix/site/publish/presentations/HadoopSummit2014-16x9.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/publish/presentations/HadoopSummit2014-16x9.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/publish/presentations/HadoopSummit2014-16x9.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: phoenix/site/publish/presentations/OC-HUG-2014-10-4x3.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/publish/presentations/OC-HUG-2014-10-4x3.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/publish/presentations/OC-HUG-2014-10-4x3.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: phoenix/site/publish/resources.html
URL: http://svn.apache.org/viewvc/phoenix/site/publish/resources.html?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/publish/resources.html (original)
+++ phoenix/site/publish/resources.html Wed Oct 29 05:38:43 2014
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2014-10-20
+ Generated by Apache Maven Doxia at 2014-10-28
  Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">
@@ -136,16 +136,31 @@
  </thead> 
  <tbody> 
   <tr class="b"> 
+   <td>OC Hadoop User Group 2014 </td> 
+   <td>Apache Phoenix: Transforming HBase into a Relational Database </td> 
+   <td><a class="externalLink" href="http://phoenix.apache.org/presentations/OC-HUG-2014-10-4x3.pdf">OC-HUG-2014-10-4x3.pdf</a> </td> 
+  </tr> 
+  <tr class="a"> 
+   <td>Hadoop Summit 2014 </td> 
+   <td><a class="externalLink" href="https://www.youtube.com/watch?v=f4Nmh5KM6gI&amp;feature=youtu.be">Apache Phoenix: Transforming HBase into a SQL database</a> </td> 
+   <td><a class="externalLink" href="http://phoenix.apache.org/presentations/HadoopSummit2014-16x9.pdf">HadoopSummit2014-16x9.pdf</a> </td> 
+  </tr> 
+  <tr class="b"> 
+   <td>HBaseCon 2014 </td> 
+   <td><a class="externalLink" href="http://vimeo.com/98485780">Taming HBase with Apache Phoenix and SQL</a> </td> 
+   <td><a class="externalLink" href="http://phoenix.apache.org/presentations/HBaseCon2014-16x9.pdf">HBaseCon2014-16x9.pdf</a> </td> 
+  </tr> 
+  <tr class="a"> 
    <td>ApacheCon 2014 </td> 
    <td><a class="externalLink" href="https://www.youtube.com/watch?v=9qfBnFyKZwM">How Apache Phoenix enables interactive, low latency applications over your HBase data</a> </td> 
    <td><a class="externalLink" href="http://phoenix.apache.org/presentations/ApacheCon2014-16x9.pdf">ApacheCon2014-16x9.pdf</a> </td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td>Hadoop Summit 2013 </td> 
    <td><a class="externalLink" href="http://www.youtube.com/watch?v=YHsHdQ08trg">How (and why) Phoenix puts the SQL back into NoSQL</a> </td> 
    <td><a class="externalLink" href="http://phoenix.apache.org/presentations/HadoopSummit2013-16x9.pdf">HadoopSummit2013-16x9.pdf</a> </td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td>HBaseCon 2013 </td> 
    <td><a class="externalLink" href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--how-and-why-phoenix-puts-the-sql-back-into-nosql-video.html">How (and why) Phoenix puts the SQL back into NoSQL</a> </td> 
    <td><a class="externalLink" href="http://phoenix.apache.org/presentations/HBaseCon2013-4x3.pdf">HBaseCon2013-4x3.pdf</a> </td> 

Modified: phoenix/site/publish/tuning.html
URL: http://svn.apache.org/viewvc/phoenix/site/publish/tuning.html?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/publish/tuning.html (original)
+++ phoenix/site/publish/tuning.html Wed Oct 29 05:38:43 2014
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2014-10-20
+ Generated by Apache Maven Doxia at 2014-10-28
  Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">
@@ -154,165 +154,155 @@
    <td>500</td> 
   </tr> 
   <tr class="b"> 
+   <td><small>phoenix.stats.guidepost.width</small></td> 
+   <td> A server-side parameter that specifies the number of bytes between guideposts. A smaller amount increases parallelization, but also increases the number of chunks which must be merged on the client side. The default value is 10 MB. </td> 
+   <td>104857600</td> 
+  </tr> 
+  <tr class="a"> 
+   <td><small>phoenix.stats.guidepost.per.region</small></td> 
+   <td> A server-side parameter that specifies the number of guideposts per region. If set to a value greater than zero, then the guidepost width is determiend by the MAX_FILE_SIZE of the table divided by this value. Otherwise, if not set then the <tt>phoenix.stats.guidepost.width</tt> parameter is used. No default value. </td> 
+   <td>None</td> 
+  </tr> 
+  <tr class="b"> 
+   <td><small>phoenix.stats.updateFrequency</small></td> 
+   <td> A server-side paramater that determines the frequency in milliseconds for which statistics will be refreshed from the statistics table and subsequently used by the client. The default value is 15 min. </td> 
+   <td>900000</td> 
+  </tr> 
+  <tr class="a"> 
+   <td><small>phoenix.stats.minUpdateFrequency</small></td> 
+   <td> A client-side parameter that determines the minimum amount of time in milliseconds that must pass before statistics may again be manually collected through another <tt>UPDATE STATISTICS</tt> call. The default value is <tt>phoenix.stats.updateFrequency</tt>/2. </td> 
+   <td>450000</td> 
+  </tr> 
+  <tr class="b"> 
+   <td><small>phoenix.stats.useCurrentTime</small></td> 
+   <td> An advanced server-side parameter that if true causes the current time on the server-side to be used as the timestamp of rows in the statistics table when background tasks such as compactions or splits occur. If false, then the max timestamp found while traversing the table over which statistics are being collected is used as the timestamp. Unless your client is controlling the timestamps while reading and writing data, this parameter should be left alone. The default value is true. </td> 
+   <td>true</td> 
+  </tr> 
+  <tr class="a"> 
    <td><small>phoenix.query.spoolThresholdBytes</small></td> 
    <td style="text-align: left;">Threshold size in bytes after which results from parallelly executed query results are spooled to disk. Default is 20 mb.</td> 
    <td>20971520</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.query.maxSpoolToDiskBytes</small></td> 
    <td style="text-align: left;">Threshold size in bytes up to which results from parallelly executed query results are spooled to disk above which the query will fail. Default is 1 gb.</td> 
    <td>1024000000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.query.maxGlobalMemoryPercentage</small></td> 
    <td style="text-align: left;">Percentage of total heap memory (i.e. Runtime.getRuntime().maxMemory()) that all threads may use. Only course grain memory usage is tracked, mainly accounting for memory usage in the intermediate map built during group by aggregation. When this limit is reached the clients block attempting to get more memory, essentially throttling memory usage. Defaults to 15%</td> 
    <td>15</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.query.maxGlobalMemorySize</small></td> 
    <td style="text-align: left;">Max size in bytes of total tracked memory usage. By default not specified, however, if present, the lower of this parameter and the phoenix.query.maxGlobalMemoryPercentage will be used </td> 
    <td>&nbsp;</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.query.maxGlobalMemoryWaitMs</small></td> 
    <td style="text-align: left;">Maximum amount of time that a client will block while waiting for more memory to become available. After this amount of time, an <tt>InsufficientMemoryException</tt> is thrown. Default is 10 sec.</td> 
    <td>10000</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.query.maxTenantMemoryPercentage</small></td> 
    <td style="text-align: left;">Maximum percentage of <tt>phoenix.query.maxGlobalMemoryPercentage</tt> that any one tenant is allowed to consume. After this percentage, an <tt>InsufficientMemoryException</tt> is thrown. Default is 100%</td> 
    <td>100</td> 
   </tr> 
-  <tr class="b"> 
-   <td><small>phoenix.query.targetConcurrency</small></td> 
-   <td style="text-align: left;">Target concurrent threads to use for a query. It serves as a soft limit on the number of scans into which a query may be split. The value should not exceed the hard limit imposed by<tt> phoenix.query.maxConcurrency</tt>.</td> 
-   <td>32</td> 
-  </tr> 
   <tr class="a"> 
-   <td><small>phoenix.query.maxConcurrency</small></td> 
-   <td style="text-align: left;">Maximum concurrent threads to use for a query. It servers as a hard limit on the number of scans into which a query may be split. A soft limit is imposed by <tt>phoenix.query.targetConcurrency</tt>.</td> 
-   <td>64</td> 
-  </tr> 
-  <tr class="b"> 
    <td><small>phoenix.query.dateFormat</small></td> 
    <td style="text-align: left;">Default pattern to use for conversion of a date to/from a string, whether through the <tt>TO_CHAR(&lt;date&gt;)</tt> or <tt>TO_DATE(&lt;date-string&gt;)</tt> functions, or through <tt>resultSet.getString(&lt;date-column&gt;)</tt>. Default is yyyy-MM-dd HH:mm:ss</td> 
    <td>yyyy-MM-dd HH:mm:ss</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.query.numberFormat</small></td> 
    <td style="text-align: left;">Default pattern to use for conversion of a decimal number to/from a string, whether through the <tt>TO_CHAR(&lt;decimal-number&gt;)</tt> or <tt>TO_NUMBER(&lt;decimal-string&gt;)</tt> functions, or through <tt>resultSet.getString(&lt;decimal-column&gt;)</tt>. Default is #,##0.###</td> 
    <td>#,##0.###</td> 
   </tr> 
-  <tr class="b"> 
-   <td><small>phoenix.query.statsUpdateFrequency</small></td> 
-   <td style="text-align: left;">The frequency in milliseconds at which the stats for each table will be updated. Default is 15 min.</td> 
-   <td>900000</td> 
-  </tr> 
   <tr class="a"> 
-   <td><small>phoenix.query.maxStatsAge</small></td> 
-   <td>The maximum age of stats in milliseconds after which they will no longer be used (i.e. the stats were not able to be updated in this amount of time and thus are considered too old). Default is 1 day.</td> 
-   <td>1</td> 
-  </tr> 
-  <tr class="b"> 
    <td><small>phoenix.mutate.maxSize</small></td> 
    <td style="text-align: left;">The maximum number of rows that may be batched on the client before a commit or rollback must be called.</td> 
    <td>500000</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.mutate.batchSize</small></td> 
    <td style="text-align: left;">The number of rows that are batched together and automatically committed during the execution of an <tt>UPSERT SELECT</tt> or <tt>DELETE</tt> statement. This property may be overridden at connection time by specifying the <tt>UpsertBatchSize</tt> property value. Note that the connection property value does not affect the batch size used by the coprocessor when these statements are executed completely on the server side.</td> 
    <td>1000</td> 
   </tr> 
-  <tr class="b"> 
-   <td><small>phoenix.query.maxIntraRegionParallelization</small></td> 
-   <td style="text-align: left;">The maximum number of threads that will be spawned to process data within a single region during query execution</td> 
-   <td>64</td> 
-  </tr> 
   <tr class="a"> 
-   <td><small>phoenix.query.rowKeyOrderSaltedTable</small></td> 
-   <td style="text-align: left;">Whether or not a non aggregate query returns rows in row key order for salted tables. If this option is turned on, split points may not be specified at table create time, but instead the default splits on each salt bucket must be used. Default is true</td> 
-   <td>true</td> 
-  </tr> 
-  <tr class="b"> 
    <td><small>phoenix.query.maxServerCacheBytes</small></td> 
    <td style="text-align: left;">Maximum size (in bytes) of a single sub-query result (usually the filtered result of a table) before compression and conversion to a hash map. Attempting to hash an intermediate sub-query result of a size bigger than this setting will result in a MaxServerCacheSizeExceededException. Default 100MB.</td> 
    <td>104857600</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.coprocessor.maxServerCacheTimeToLiveMs</small></td> 
    <td style="text-align: left;">Maximum living time (in milliseconds) of server caches. A cache entry expires after this amount of time has passed since last access. Consider adjusting this parameter when a server-side IOException(“Could not find hash cache for joinId”) happens. Getting warnings like “Earlier hash cache(s) might have expired on servers” might also be a sign that this number should be increased.</td> 
    <td>30000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.query.useIndexes</small></td> 
    <td style="text-align: left;">Determines whether or not indexes are considered by the optimizer to satisfy a query. Default is true </td> 
    <td>true</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.index.mutableBatchSizeThreshold</small></td> 
    <td style="text-align: left;">Number of mutations in a batch beyond which index metadata will be sent as a separate RPC to each region server as opposed to included inline with each mutation. Defaults to 5. </td> 
    <td>5</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.schema.dropMetaData</small></td> 
    <td style="text-align: left;">Determines whether or not an HBase table is dropped when the Phoenix table is dropped. Default is true </td> 
    <td>true</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.groupby.spillable</small></td> 
    <td style="text-align: left;">Determines whether or not a GROUP BY over a large number of distinct values is allowed to spill to disk on the region server. If false, an InsufficientMemoryException will be thrown instead. Default is true </td> 
    <td>true</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.groupby.spillFiles</small></td> 
    <td style="text-align: left;">Number of memory mapped spill files to be used when spilling GROUP BY distinct values to disk. Default is 2 </td> 
    <td>2</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.groupby.maxCacheSize</small></td> 
    <td style="text-align: left;">Size in bytes of pages cached during GROUP BY spilling. Default is 100Mb </td> 
    <td>102400000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.groupby.estimatedDistinctValues</small></td> 
    <td style="text-align: left;">Number of estimated distinct values when a GROUP BY is performed. Used to perform initial sizing with growth of 1.5x each time reallocation is required. Default is 1000 </td> 
    <td>1000</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.distinct.value.compress.threshold</small></td> 
    <td style="text-align: left;">Size in bytes beyond which aggregate operations which require tracking distinct value counts (such as COUNT DISTINCT) will use Snappy compression. Default is 1Mb </td> 
    <td>1024000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.index.maxDataFileSizePerc</small></td> 
    <td style="text-align: left;">Percentage used to determine the MAX_FILESIZE for the shared index table for views relative to the data table MAX_FILESIZE. The percentage should be estimated based on the anticipated average size of an view index row versus the data row. Default is 50%. </td> 
    <td>50</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs</small></td> 
    <td style="text-align: left;">Time in milliseconds after which the server-side metadata cache for a tenant will expire if not accessed. Default is 30mins </td> 
    <td>180000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.coprocessor.maxMetaDataCacheSize</small></td> 
    <td style="text-align: left;">Max size in bytes of total server-side metadata cache after which evictions will begin to occur based on least recent access time. Default is 20Mb </td> 
    <td>20480000</td> 
   </tr> 
-  <tr class="a"> 
+  <tr class="b"> 
    <td><small>phoenix.client.maxMetaDataCacheSize</small></td> 
    <td style="text-align: left;">Max size in bytes of total client-side metadata cache after which evictions will begin to occur based on least recent access time. Default is 10Mb </td> 
    <td>10240000</td> 
   </tr> 
-  <tr class="b"> 
+  <tr class="a"> 
    <td><small>phoenix.sequence.cacheSize</small></td> 
    <td style="text-align: left;">Number of sequence values to reserve from the server and cache on the client when the next sequence value is allocated. Only used if not defined by the sequence itself. Default is 100 </td> 
    <td>100</td> 
   </tr> 
-  <tr class="a"> 
-   <td><small>phoenix.client.autoUpgradeWhiteList</small></td> 
-   <td style="text-align: left;">Comma separated list of case sensitive full table names to automatically upgrade from 2.2.x format to 3.0/4.0 format. Use * to upgrade all tables. Only applies on the first connection to a 3.0/4.0 cluster. Not specified by default. For more information, see <a class="externalLink" href="http://phoenix.apache.org/upgrade_from_2_2.html">here</a> </td> 
-   <td>&nbsp;</td> 
-  </tr> 
   <tr class="b"> 
    <td><small>phoenix.clock.skew.interval</small></td> 
    <td style="text-align: left;">Delay interval(in milliseconds) when opening SYSTEM.CATALOG to compensate possible time clock skew when SYSTEM.CATALOG moves among region servers. </td> 
@@ -333,6 +323,51 @@
    <td style="text-align: left;">Index rebuild job builds an index from when it failed - the time interval(in milliseconds) in order to create a time overlap to prevent missing updates when there exists time clock skew. </td> 
    <td>300000</td> 
   </tr> 
+  <tr class="b"> 
+   <td><small> 
+     <s>
+       phoenix.query.targetConcurrency 
+     </s></small><br />Obsolete as of 3.2/4.2</td> 
+   <td style="text-align: left;">Target concurrent threads to use for a query. It serves as a soft limit on the number of scans into which a query may be split. The value should not exceed the hard limit imposed by<tt> phoenix.query.maxConcurrency</tt>.</td> 
+   <td>32</td> 
+  </tr> 
+  <tr class="a"> 
+   <td><small> 
+     <s>
+       phoenix.query.maxConcurrency 
+     </s></small><br />Obsolete as of 3.2/4.2</td> 
+   <td style="text-align: left;">Maximum concurrent threads to use for a query. It servers as a hard limit on the number of scans into which a query may be split. A soft limit is imposed by <tt>phoenix.query.targetConcurrency</tt>.</td> 
+   <td>64</td> 
+  </tr> 
+  <tr class="b"> 
+   <td><small> 
+     <s>
+       phoenix.query.maxStatsAge 
+     </s></small><br />Obsolete as of 3.2/4.2</td> 
+   <td>The maximum age of stats in milliseconds after which they will no longer be used (i.e. the stats were not able to be updated in this amount of time and thus are considered too old). Default is 1 day.</td> 
+   <td>1</td> 
+  </tr> 
+  <tr class="a"> 
+   <td><small> 
+     <s>
+       phoenix.query.statsUpdateFrequency 
+     </s></small><br />Obsolete as of 3.2/4.2</td> 
+   <td style="text-align: left;">The frequency in milliseconds at which the stats for each table will be updated. Default is 15 min.</td> 
+   <td>900000</td> 
+  </tr> 
+  <tr class="b"> 
+   <td><small> 
+     <s>
+       phoenix.query.maxIntraRegionParallelization 
+     </s></small><br />Obsolete as of 3.2/4.2</td> 
+   <td style="text-align: left;">The maximum number of threads that will be spawned to process data within a single region during query execution</td> 
+   <td>64</td> 
+  </tr> 
+  <tr class="a"> 
+   <td><small>phoenix.query.rowKeyOrderSaltedTable</small></td> 
+   <td style="text-align: left;">Whether or not a non aggregate query returns rows in row key order for salted tables. If this option is turned on, split points may not be specified at table create time, but instead the default splits on each salt bucket must be used. Default is true</td> 
+   <td>true</td> 
+  </tr> 
  </tbody> 
 </table> 
 <br /> 
@@ -340,88 +375,10 @@
  <div class="section"> 
   <div class="section"> 
    <h4 id="Parallelization"> Parallelization</h4> Phoenix breaks up aggregate queries into multiple scans and runs them in parallel through custom aggregating coprocessors to improve performance.&nbsp;Hari Kumar, from Ericsson Labs, did a good job of explaining the performance benefits of parallelization and coprocessors 
-   <a class="externalLink" href="http://labs.ericsson.com/blog/hbase-performance-tuners" target="_blank">here</a>. One of the most important factors in getting good query performance with Phoenix is to ensure that table splits are well balanced. This includes having regions of equal size as well as an even distribution across region servers. There are open source tools such as&nbsp; 
-   <a class="externalLink" href="http://www.sentric.ch/blog/hbase-split-visualisation-introducing-hannibal" target="_blank">Hannibal</a>&nbsp;that can help you monitor this. By having an even distribution of data, every thread spawned by the Phoenix client will have an equal amount of work to process, thus reducing the time it takes to get the results back. 
-   <br /> 
-   <br /> The 
-   <tt>phoenix.query.targetConcurrency</tt> and 
-   <tt>phoenix.query.maxConcurrency</tt> control how a query is broken up into multiple scans on the client side. The idea for parallelization of queries is to align the scan boundaries with region boundaries. If rows are not evenly distributed across regions, using this scheme compensates for regions that have more rows than others, by applying tighter splits and therefore spawning off more scans over the overloaded regions. 
-   <br /> 
-   <br /> The split points for parallelization are computed as follows. Let’s suppose: 
-   <br /> 
-   <ul> 
-    <li><tt>t</tt> is the target concurrency</li> 
-    <li><tt>m</tt> is the max concurrency</li> 
-    <li><tt>r</tt> is the number of regions we need to scan</li> 
-   </ul> 
-   <tt>if r &gt;= t</tt> 
-   <br /> &nbsp;&nbsp; scan using regional boundaries 
-   <br /> 
-   <tt>else if r/2 &gt; t</tt> 
-   <br /> &nbsp;&nbsp; split each region in s splits such that: 
-   <tt>s = max(x) where s * x &lt; m</tt> 
-   <br /> 
-   <tt>else</tt> 
-   <br /> &nbsp;&nbsp; split each region in s splits such that:&nbsp; 
-   <tt>s = max(x) where s * x &lt; t</tt> 
-   <br /> 
-   <br /> Depending on the number of cores in your client machine and the size of your cluster, the 
-   <tt>phoenix.query.threadPoolSize</tt>, 
-   <tt>phoenix.query.queueSize</tt>, 
-   <tt> phoenix.query.maxConcurrency</tt>, and 
-   <tt>phoenix.query.targetConcurrency</tt> may all be increased to allow more threads to process a query in parallel. This will allow Phoenix to divide up a query into more scans that may then be executed in parallel, thus reducing latency. 
-   <br /> 
-   <br /> This approach is not without its limitations. The primary issue is that Phoenix does not have sufficient information to divide up a region into equal data sizes. If the query results span many regions of data, this is not a problem, since regions are more or less of equal size. However, if a query accesses only a few regions, this can be an issue. The best Phoenix can do is to divide up the key space between the start and end key evenly. If there’s any skew in the data, then some scans are bound to bear the brunt of the work. You can adjust 
-   <tt>phoenix.query.maxIntraRegionParallelization</tt> to a smaller number to decrease the number of threads spawned per region if you find that throughput is suffering. 
-   <br /> 
-   <br /> For example, let’s say a row key is comprised of a five digit zip code in California, declared as a CHAR(5). Phoenix only knows that the column has 5 characters. In theory, the byte array could vary from five 0x01 bytes to five 0xff bytes (or what ever is the largest valid UTF-8 encoded single byte character). While in actuality, the range is from&nbsp;90001 to 96162. Since Phoenix doesn’t know this, it’ll divide up the region based on the theoretical range and all of the work will end up being done by the single thread that has the range encompassing the actual data. The same thing will occur with a DATE column, since the theoretical range is from 1970 to&nbsp;2038, while in actuality the date is probably +/- a year from the current date. Even if Phoenix uses better defaults for the start and end range rather than the theoretical min and max, it would not usually help - there’s just too much variability across domains. 
-   <br /> 
-   <br /> One solution to this problem is to maintain statistics for a table to feed into the parallelization process to ensure an even data distribution. This is the solution we’re working on, as described in more detail in this 
-   <a class="externalLink" href="https://issues.apache.org/jira/browse/PHOENIX-180" target="_blank">issue</a>. 
-   <br /> 
-  </div> 
-  <div class="section"> 
-   <h4 id="Batching"> Batching</h4> An important HBase configuration property 
-   <tt>hbase.client.scanner.caching</tt> controls scanner caching, that is how many rows are returned from the server in a single round trip when a scan is performed. Although this is less important for aggregate queries, since the Phoenix coprocessors are performing the aggregation instead of returning all the data back to the client, it is important for non aggregate queries. If unset, Phoenix defaults this property to 1000. 
-   <br /> 
-   <br /> On the DML side of the fence, performance may improve by turning the connection auto commit to on for multi-row mutations such as those that can occur with 
-   <tt>DELETE</tt> and 
-   <tt>UPSERT SELECT</tt>. In this case, if possible, the mutation will be performed completely on the server side without returning data back to the client. However, when performing single row mutations, such as 
-   <tt>UPSERT VALUES</tt>, the opposite is true: auto commit should be off and a reasonable number of rows should be batched together for a single commit to reduce RPC traffic. 
-   <br /> 
+   <a class="externalLink" href="http://labs.ericsson.com/blog/hbase-performance-tuners" target="_blank">here</a>. 
+   <p>As of 3.2/4.2, parallelization in Phoenix is driven by the guideposts as determined by the configuration parameters for <a class="externalLink" href="http://phoenix.apache.org/update_statistics.html">statistics collection</a>. Each chunk of data between guideposts will be run in parallel in a separate scan to improve query performance. Note that at a minimum, separate scans will be run for each table region. Beyond the statistics collection configuration parameters, the client-side <tt>phoenix.query.threadPoolSize</tt> and <tt>phoenix.query.queueSize</tt> parameters and the server-side <tt>hbase.regionserver.handler.count</tt> parameter have an impact on performance.</p> 
   </div> 
  </div> 
- <div class="section"> 
-  <h3 id="Measuring_Performance"> Measuring Performance</h3> One way to get a feeling for how to configure these properties is to use the performance.py shell script provided in the bin directory of the installation tar. 
-  <br /> 
-  <br /> 
-  <b>Usage: </b> 
-  <tt>performance.py &lt;zookeeper&gt; &lt;row count&gt;</tt> 
-  <br /> 
-  <b>Example: </b> 
-  <tt>performance.py localhost 1000000</tt> 
-  <br /> 
-  <br /> This will create a new table named 
-  <tt>performance_1000000</tt> and upsert 1000000 rows. The schema and data generated is similar to 
-  <tt>examples/web_stat.sql</tt> and 
-  <tt>examples/web_stat.csv</tt>. On the console it will measure the time it takes to: 
-  <br /> 
-  <ul> 
-   <li>upsert these rows</li> 
-   <li>run queries that perform <tt>COUNT</tt>, <tt>GROUP BY</tt>, and <tt>WHERE</tt> clause filters</li> 
-  </ul> For convenience, an 
-  <tt>hbase-site.xml</tt> file is included in the bin directory and pre-configured to already be on the classpath during script execution. 
-  <br /> 
-  <br /> Here is a screenshot of the performance.py script in action: 
-  <br /> 
-  <a class="externalLink" href="http://1.bp.blogspot.com/-VhinivNOJmI/URWBGLYTiHI/AAAAAAAAAQU/Dp9lbH2CxYE/s1600/performance_script.png" style="margin-left: 1em; margin-right: 1em;" rel="lightbox[page]"><img src="http://1.bp.blogspot.com/-VhinivNOJmI/URWBGLYTiHI/AAAAAAAAAQU/Dp9lbH2CxYE/s640/performance_script.png" border="0" height="640" width="497" alt="" /></a> 
- </div> 
- <div class="section"> 
-  <h3 id="Conclusion"> &nbsp;Conclusion</h3> Phoenix has many knobs and dials to tailor the system to your use case. From controlling the level of parallelization, to the size of batches, to the consumption of resource, 
-  <i>there’s a knob for that</i>. &nbsp;These controls are not without there limitations, however. There’s still more work to be done and we’d love to hear your ideas on what you’d like to see made more configurable. 
-  <br /> 
-  <br /> 
- </div> 
 </div>
 			</div>
 		</div>

Modified: phoenix/site/publish/update_statistics.html
URL: http://svn.apache.org/viewvc/phoenix/site/publish/update_statistics.html?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/publish/update_statistics.html (original)
+++ phoenix/site/publish/update_statistics.html Wed Oct 29 05:38:43 2014
@@ -1,7 +1,7 @@
 
 <!DOCTYPE html>
 <!--
- Generated by Apache Maven Doxia at 2014-10-20
+ Generated by Apache Maven Doxia at 2014-10-28
  Rendered using Reflow Maven Skin 1.1.0 (http://andriusvelykis.github.io/reflow-maven-skin)
 -->
 <html  xml:lang="en" lang="en">
@@ -126,7 +126,7 @@
  <h1>Statistics Collection</h1>
 </div> 
 <p>The UPDATE STATISTICS command updates the statistics collected on a table, to improve query performance. This command collects a set of keys per region per column family that are equal byte distanced from each other. These collected keys are called <i>guideposts</i> and they act as <i>hints/guides</i> to improve the parallelization of queries on a given target region.</p> 
-<p>The statistics are also collected during major compaction and when ever a region split happens so manually running the command may not be necessary.</p> 
+<p>Statistics are also automatically collected during major compactions and region splits so manually running this command may not be necessary.</p> 
 <div class="section"> 
  <h2 id="Examples">Examples</h2> 
  <p>For a given table <tt>my_table</tt>:</p> 
@@ -153,24 +153,32 @@
 </div> 
 <div class="section"> 
  <h2 id="Configurations">Configurations</h2> 
- <p>Some of the configurations associated with the UPDATE STATISTICS command are</p> 
+ <p>The configuration parameters controlling statistics collection include:</p> 
  <ol style="list-style-type: decimal"> 
   <li><tt>phoenix.stats.guidepost.width</tt> 
    <ul> 
-    <li>A server-side configuration that specifies the absolute byte value that determines the number of bytes between the collected guideposts.</li> 
+    <li>A server-side parameter that specifies the number of bytes between guideposts. A smaller amount increases parallelization, but also increases the number of chunks which must be merged on the client side.</li> 
+    <li>The default value is 104857600 (100 MB).</li> 
    </ul></li> 
   <li><tt>phoenix.stats.guidepost.per.region</tt> 
    <ul> 
-    <li>Determines the number of guideposts per region. If the configuration ‘phoenix.stats.guidepost.width’ is not set then the MAX_FILE_SIZE associated with the table (for which statistics collection is needed) divided by the values set for this configuration will be used for ‘phoenix.stats.guidepost.width’.</li> 
-    <li>The default value is 20</li> 
+    <li>A server-side parameter that specifies the number of guideposts per region. If set to a value greater than zero, then the guidepost width is determiend by the MAX_FILE_SIZE of the table divided by this value. Otherwise, if not set then the <tt>phoenix.stats.guidepost.width</tt> parameter is used.</li> 
+    <li>No default value.</li> 
+   </ul></li> 
+  <li><tt>phoenix.stats.updateFrequency</tt> 
+   <ul> 
+    <li>A server-side paramater that determines the frequency in milliseconds for which statistics will be refreshed from the statistics table and subsequently used by the client.</li> 
+    <li>The default value is 900000 (15 mins)</li> 
    </ul></li> 
   <li><tt>phoenix.stats.minUpdateFrequency</tt> 
    <ul> 
-    <li>Minimum time in milliseconds that must be passed before another UPDATE STATISTICS call be issued once again to collect the statistics again.</li> 
+    <li>A client-side parameter that determines the minimum amount of time in milliseconds that must pass before statistics may again be manually collected through another <tt>UPDATE STATISTICS</tt> call.</li> 
+    <li>The default value is <tt>phoenix.stats.updateFrequency</tt> divided by two (7.5 mins)</li> 
    </ul></li> 
-  <li><tt>phoenix.stats.updateFrequency</tt> 
+  <li><tt>phoenix.stats.useCurrentTime</tt> 
    <ul> 
-    <li>Minimum frequency in milliseconds that new statistics will be checked for when pulling over new metadata from the client to the server.</li> 
+    <li>An advanced server-side parameter that if true causes the current time on the server-side to be used as the timestamp of rows in the statistics table when background tasks such as compactions or splits occur. If false, then the max timestamp found while traversing the table over which statistics are being collected is used as the timestamp. Unless your client is controlling the timestamps while reading and writing data, this parameter should be left alone.</li> 
+    <li>The default value is true.</li> 
    </ul></li> 
  </ol> 
 </div>

Modified: phoenix/site/source/src/site/markdown/index.md
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/index.md?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/index.md (original)
+++ phoenix/site/source/src/site/markdown/index.md Wed Oct 29 05:38:43 2014
@@ -1,12 +1,12 @@
 # Overview
 
-Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in [performance](performance.html) on the order of milliseconds for small queries, or seconds for tens of millions of rows. 
+Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in [performance](performance.html) on the order of milliseconds for small queries, or seconds for tens of millions of rows. 
 
 ## Mission
 Become the standard means of accessing HBase data through a well-defined, industry standard API.
 
 ## Quick Start
-Tired of reading already and just want to get started? Take a look at our [FAQs](faq.html), listen to the Apache Phoenix talks from [Hadoop Summit 2013](http://www.youtube.com/watch?v=YHsHdQ08trg) and [HBaseConn 2013](http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--how-and-why-phoenix-puts-the-sql-back-into-nosql-video.html), and jump over to our quick start guide [here](Phoenix-in-15-minutes-or-less.html).
+Tired of reading already and just want to get started? Take a look at our [FAQs](faq.html), listen to the Apache Phoenix talks from [Hadoop Summit 2014](https://www.youtube.com/watch?v=f4Nmh5KM6gI&feature=youtu.be), review the [overview presentation](http://phoenix.apache.org/presentations/OC-HUG-2014-10-4x3.pdf), and jump over to our quick start guide [here](Phoenix-in-15-minutes-or-less.html).
 
 ##SQL Support##
 To see what's supported, go to our [language reference](language/index.html). It includes all typical SQL query statement clauses, including `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.

Modified: phoenix/site/source/src/site/markdown/resources.md
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/resources.md?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/resources.md (original)
+++ phoenix/site/source/src/site/markdown/resources.md Wed Oct 29 05:38:43 2014
@@ -3,6 +3,9 @@ Below are some prior presentations that 
 
 | Conference | Video | Presentation |
 |------------|-------|--------------|
+| OC Hadoop User Group 2014 | Apache Phoenix: Transforming HBase into a Relational Database | [OC-HUG-2014-10-4x3.pdf](http://phoenix.apache.org/presentations/OC-HUG-2014-10-4x3.pdf) |
+| Hadoop Summit 2014 | [Apache Phoenix: Transforming HBase into a SQL database](https://www.youtube.com/watch?v=f4Nmh5KM6gI&feature=youtu.be) | [HadoopSummit2014-16x9.pdf](http://phoenix.apache.org/presentations/HadoopSummit2014-16x9.pdf) |
+| HBaseCon 2014 | [Taming HBase with Apache Phoenix and SQL](http://vimeo.com/98485780) | [HBaseCon2014-16x9.pdf](http://phoenix.apache.org/presentations/HBaseCon2014-16x9.pdf) |
 | ApacheCon 2014 | [How Apache Phoenix enables interactive, low latency applications over your HBase data](https://www.youtube.com/watch?v=9qfBnFyKZwM) | [ApacheCon2014-16x9.pdf](http://phoenix.apache.org/presentations/ApacheCon2014-16x9.pdf) |
 | Hadoop Summit 2013  | [How (and why) Phoenix puts the SQL back into NoSQL](http://www.youtube.com/watch?v=YHsHdQ08trg) | [HadoopSummit2013-16x9.pdf](http://phoenix.apache.org/presentations/HadoopSummit2013-16x9.pdf) |
 | HBaseCon 2013  | [How (and why) Phoenix puts the SQL back into NoSQL](http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--how-and-why-phoenix-puts-the-sql-back-into-nosql-video.html) | [HBaseCon2013-4x3.pdf](http://phoenix.apache.org/presentations/HBaseCon2013-4x3.pdf) |

Modified: phoenix/site/source/src/site/markdown/tuning.md
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/tuning.md?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/tuning.md (original)
+++ phoenix/site/source/src/site/markdown/tuning.md Wed Oct 29 05:38:43 2014
@@ -26,6 +26,36 @@ of the
       beyond which an attempt to queue additional work is
       rejected by throwing an exception. If zero, a SynchronousQueue is used
       instead of the bounded round robin queue.</td><td>500</td></tr>
+<tr><td><small>phoenix.stats.guidepost.width</small></td><td>
+A server-side parameter that specifies the number of bytes between guideposts.
+      A smaller amount increases parallelization, but also increases the number of
+      chunks which must be merged on the client side. The default value is 10 MB.
+</td><td>104857600</td></tr>
+<tr><td><small>phoenix.stats.guidepost.per.region</small></td><td>
+A server-side parameter that specifies the number of guideposts per region.
+      If set to a value greater than zero, then the guidepost width is determiend by
+      the MAX_FILE_SIZE of the table divided by this value. Otherwise, if not set
+      then the <code>phoenix.stats.guidepost.width</code> parameter is used. No
+default value.
+</td><td>None</td></tr>
+<tr><td><small>phoenix.stats.updateFrequency</small></td><td>
+A server-side paramater that determines the frequency in milliseconds for which statistics
+will be refreshed from the statistics table and subsequently used by the client. The
+default value is 15 min.
+</td><td>900000</td></tr>
+<tr><td><small>phoenix.stats.minUpdateFrequency</small></td><td>
+A client-side parameter that determines the minimum amount of time in milliseconds that
+      must pass before statistics may again be manually collected through another <code>UPDATE
+      STATISTICS</code> call. The default value is <code>phoenix.stats.updateFrequency</code>/2. 
+</td><td>450000</td></tr>
+<tr><td><small>phoenix.stats.useCurrentTime</small></td><td>
+An advanced server-side parameter that if true causes the current time on the server-side
+      to be used as the timestamp of rows in the statistics table when background tasks such as
+      compactions or splits occur. If false, then the max timestamp found while traversing the
+      table over which statistics are being collected is used as the timestamp. Unless your
+      client is controlling the timestamps while reading and writing data, this parameter
+      should be left alone. The default value is true.
+</td><td>true</td></tr>
 <tr><td><small>phoenix.query.spoolThresholdBytes</small></td><td style="text-align: left;">Threshold
       size in bytes after which results from parallelly executed
       query results are spooled to disk. Default is 20 mb.</td><td>20971520</td></tr>
@@ -45,13 +75,6 @@ of the
 any one tenant is allowed to consume. After this percentage, an
 <code>InsufficientMemoryException</code> is
       thrown. Default is 100%</td><td>100</td></tr>
-<tr><td><small>phoenix.query.targetConcurrency</small></td><td style="text-align: left;">Target concurrent
-      threads to use for a query. It serves as a soft limit on the number of
-      scans into which a query may be split. The value should not exceed the hard limit imposed by<code> phoenix.query.maxConcurrency</code>.</td><td>32</td></tr>
-<tr><td><small>phoenix.query.maxConcurrency</small></td><td style="text-align: left;">Maximum concurrent
-      threads to use for a query. It servers as a hard limit on the number
-      of scans into which a query may be split. A soft limit is imposed by
-<code>phoenix.query.targetConcurrency</code>.</td><td>64</td></tr>
 <tr><td><small>phoenix.query.dateFormat</small></td><td style="text-align: left;">Default pattern to use
       for conversion of a date to/from a string, whether through the
       <code>TO_CHAR(&lt;date&gt;)</code> or
@@ -62,11 +85,6 @@ any one tenant is allowed to consume. Af
       <code>TO_CHAR(&lt;decimal-number&gt;)</code> or
 <code>TO_NUMBER(&lt;decimal-string&gt;)</code> functions, or through
 <code>resultSet.getString(&lt;decimal-column&gt;)</code>. Default is #,##0.###</td><td>#,##0.###</td></tr>
-<tr><td><small>phoenix.query.statsUpdateFrequency</small></td><td style="text-align: left;">The frequency
-      in milliseconds at which the stats for each table will be
-updated. Default is 15 min.</td><td>900000</td></tr>
-<tr><td><small>phoenix.query.maxStatsAge</small></td><td>The maximum age of
-      stats in milliseconds after which they will no longer be used (i.e. the stats were not able to be updated in this amount of time and thus are considered too old). Default is 1 day.</td><td>1</td></tr>
 <tr><td><small>phoenix.mutate.maxSize</small></td><td style="text-align: left;">The maximum number of rows
       that may be batched on the client
       before a commit or rollback must be called.</td><td>500000</td></tr>
@@ -75,8 +93,6 @@ updated. Default is 15 min.</td><td>9000
 overridden at connection
       time by specifying the <code>UpsertBatchSize</code>
       property value. Note that the connection property value does not affect the batch size used by the coprocessor when these statements are executed completely on the server side.</td><td>1000</td></tr>
-<tr><td><small>phoenix.query.maxIntraRegionParallelization</small></td><td style="text-align: left;">The maximum number of threads that will be spawned to process data within a single region during query execution</td><td>64</td></tr>
-<tr><td><small>phoenix.query.rowKeyOrderSaltedTable</small></td><td style="text-align: left;">Whether or not a non aggregate query returns rows in row key order for salted tables. If this option is turned on, split points may not be specified at table create time, but instead the default splits on each salt bucket must be used. Default is true</td><td>true</td></tr>
 <tr><td><small>phoenix.query.maxServerCacheBytes</small></td><td style="text-align: left;">Maximum size (in bytes) of a single sub-query result (usually the filtered result of a table) before compression and conversion to a hash map. Attempting to hash an intermediate sub-query result of a size bigger than this setting will result in a MaxServerCacheSizeExceededException. Default 100MB.</td><td>104857600</td></tr>
 <tr><td><small>phoenix.coprocessor.maxServerCacheTimeToLiveMs</small></td><td style="text-align: left;">Maximum living time (in milliseconds) of server caches. A cache entry expires after this amount of time has passed since last access. Consider adjusting this parameter when a server-side IOException("Could not find hash cache for joinId") happens. Getting warnings like "Earlier hash cache(s) might have expired on servers" might also be a sign that this number should be increased.</td><td>30000</td></tr>
 <tr><td><small>phoenix.query.useIndexes</small></td><td style="text-align: left;">Determines whether or not indexes are considered by the optimizer to satisfy a query. Default is true
@@ -105,8 +121,6 @@ overridden at connection
 </td><td>10240000</td></tr>
 <tr><td><small>phoenix.sequence.cacheSize</small></td><td style="text-align: left;">Number of sequence values to reserve from the server and cache on the client when the next sequence value is allocated. Only used if not defined by the sequence itself. Default is 100
 </td><td>100</td></tr>
-<tr><td><small>phoenix.client.autoUpgradeWhiteList</small></td><td style="text-align: left;">Comma separated list of case sensitive full table names to automatically upgrade from 2.2.x format to 3.0/4.0 format. Use * to upgrade all tables. Only applies on the first connection to a 3.0/4.0 cluster. Not specified by default. For more information, see [here](http://phoenix.apache.org/upgrade_from_2_2.html)
-</td><td>&nbsp;</td></tr>
 <tr><td><small>phoenix.clock.skew.interval</small></td><td style="text-align: left;">Delay interval(in milliseconds) when opening SYSTEM.CATALOG to compensate possible time clock skew when SYSTEM.CATALOG moves among region servers. 
 </td><td>2000</td></tr>
 <tr><td><small>phoenix.index.failure.handling.rebuild</small></td><td style="text-align: left;">Boolean flag which turns on/off auto-rebuild a failed index from when some updates are failed to be updated into the index.
@@ -115,57 +129,30 @@ overridden at connection
 </td><td>10000</td></tr>
 <tr><td><small>phoenix.index.failure.handling.rebuild.overlap.time</small></td><td style="text-align: left;">Index rebuild job builds an index from when it failed - the time interval(in milliseconds) in order to create a time overlap to prevent missing updates when there exists time clock skew.
 </td><td>300000</td></tr>
+<tr><td><strike><small>phoenix.query.targetConcurrency</small></strike><br/>Obsolete as of 3.2/4.2</td><td style="text-align: left;">Target concurrent
+      threads to use for a query. It serves as a soft limit on the number of
+      scans into which a query may be split. The value should not exceed the hard limit imposed by<code> phoenix.query.maxConcurrency</code>.</td><td>32</td></tr>
+<tr><td><strike><small>phoenix.query.maxConcurrency</small></strike><br/>Obsolete as of 3.2/4.2</td><td style="text-align: left;">Maximum concurrent
+      threads to use for a query. It servers as a hard limit on the number
+      of scans into which a query may be split. A soft limit is imposed by
+<code>phoenix.query.targetConcurrency</code>.</td><td>64</td></tr>
+<tr><td><strike><small>phoenix.query.maxStatsAge</small></strike><br/>Obsolete as of 3.2/4.2</td><td>The maximum age of
+      stats in milliseconds after which they will no longer be used (i.e. the stats were not able to be updated in this amount of time and thus are considered too old). Default is 1 day.</td><td>1</td></tr>
+<tr><td><strike><small>phoenix.query.statsUpdateFrequency</small></strike><br/>Obsolete as of 3.2/4.2</td><td style="text-align: left;">The frequency
+      in milliseconds at which the stats for each table will be
+updated. Default is 15 min.</td><td>900000</td></tr>
+<tr><td><strike><small>phoenix.query.maxIntraRegionParallelization</small></strike><br/>Obsolete as of 3.2/4.2</td><td style="text-align: left;">The maximum number of threads that will be spawned to process data within a single region during query execution</td><td>64</td></tr>
+<tr><td><small>phoenix.query.rowKeyOrderSaltedTable</small></td><td style="text-align: left;">Whether or not a non aggregate query returns rows in row key order for salted tables. If this option is turned on, split points may not be specified at table create time, but instead the default splits on each salt bucket must be used. Default is true</td><td>true</td></tr>
 </tbody></table>
 <br />
 <h4>
 Parallelization</h4>
-Phoenix breaks up aggregate queries into multiple scans and runs them in parallel through custom aggregating coprocessors to improve performance.&nbsp;Hari Kumar, from Ericsson Labs, did a good job of explaining the performance benefits of parallelization and coprocessors <a href="http://labs.ericsson.com/blog/hbase-performance-tuners" target="_blank">here</a>. One of the most important factors in getting good query performance with Phoenix is to ensure that table splits are well balanced. This includes having regions of equal size as well as an even distribution across region servers. There are open source tools such as&nbsp;<a href="http://www.sentric.ch/blog/hbase-split-visualisation-introducing-hannibal" target="_blank">Hannibal</a>&nbsp;that can help you monitor this. By having an even distribution of data, every thread spawned by the Phoenix client will have an equal amount of work to process, thus reducing the time it takes to get the results back. <br />
-<br />
-The <code>phoenix.query.targetConcurrency</code> and <code>phoenix.query.maxConcurrency</code> control how a query is broken up into multiple scans on the client side. The idea for parallelization of queries is to align the scan boundaries with region boundaries. If rows are not evenly distributed across regions, using this scheme compensates for regions that have more rows than others, by applying tighter splits and therefore spawning off more scans over the overloaded regions.<br />
-<br />
-The split points for parallelization are computed as follows. Let's suppose:<br />
-<ul>
-<li><code>t</code> is the target concurrency</li>
-<li><code>m</code> is the max concurrency</li>
-<li><code>r</code> is the number of regions we need to scan</li>
-</ul>
-<code>if r &gt;= t</code><br />
-&nbsp;&nbsp; scan using regional boundaries<br />
-<code>else if r/2 &gt; t</code><br />
-&nbsp;&nbsp; split each region in s splits such that: <code>s = max(x) where s * x &lt; m</code><br />
-<code>else</code><br />
-&nbsp;&nbsp; split each region in s splits such that:&nbsp; <code>s = max(x) where s * x &lt; t</code><br />
-<br />
-Depending on the number of cores in your client machine and the size of your cluster, the <code>phoenix.query.threadPoolSize</code>, <code>phoenix.query.queueSize</code>,<code> phoenix.query.maxConcurrency</code>, and <code>phoenix.query.targetConcurrency</code> may all be increased to allow more threads to process a query in parallel. This will allow Phoenix to divide up a query into more scans that may then be executed in parallel, thus reducing latency.<br />
-<br />
-This approach is not without its limitations. The primary issue is that Phoenix does not have sufficient information to divide up a region into equal data sizes. If the query results span many regions of data, this is not a problem, since regions are more or less of equal size. However, if a query accesses only a few regions, this can be an issue. The best Phoenix can do is to divide up the key space between the start and end key evenly. If there's any skew in the data, then some scans are bound to bear the brunt of the work. You can adjust <code>phoenix.query.maxIntraRegionParallelization</code> to a smaller number to decrease the number of threads spawned per region if you find that throughput is suffering.<br />
-<br />
-For example, let's say a row key is comprised of a five digit zip code in California, declared as a CHAR(5). Phoenix only knows that the column has 5 characters. In theory, the byte array could vary from five 0x01 bytes to five 0xff bytes (or what ever is the largest valid UTF-8 encoded single byte character). While in actuality, the range is from&nbsp;90001 to 96162. Since Phoenix doesn't know this, it'll divide up the region based on the theoretical range and all of the work will end up being done by the single thread that has the range encompassing the actual data. The same thing will occur with a DATE column, since the theoretical range is from 1970 to&nbsp;2038, while in actuality the date is probably +/- a year from the current date. Even if Phoenix uses better defaults for the start and end range rather than the theoretical min and max, it would not usually help - there's just too much variability across domains.<br />
-<br />
-One solution to this problem is to maintain statistics for a table to feed into the parallelization process to ensure an even data distribution. This is the solution we're working on, as described in more detail in this <a href="https://issues.apache.org/jira/browse/PHOENIX-180" target="_blank">issue</a>.<br />
-<h4>
-Batching</h4>
-An important HBase configuration property <code>hbase.client.scanner.caching</code> controls scanner caching, that is how many rows are returned from the server in a single round trip when a scan is performed. Although this is less important for aggregate queries, since the Phoenix coprocessors are performing the aggregation instead of returning all the data back to the client, it is important for non aggregate queries. If unset, Phoenix defaults this property to 1000.<br />
-<br />
-On the DML side of the fence, performance may improve by turning the connection auto commit to on for multi-row mutations such as those that can occur with <code>DELETE</code> and <code>UPSERT SELECT</code>. In this case, if possible, the mutation will be performed completely on the server side without returning data back to the client. However, when performing single row mutations, such as <code>UPSERT VALUES</code>, the opposite is true: auto commit should be off and a reasonable number of rows should be batched together for a single commit to reduce RPC traffic.<br />
-<h3>
-Measuring Performance</h3>
-One way to get a feeling for how to configure these properties is to use the performance.py shell script provided in the bin directory of the installation tar.<br />
-<br />
-<b>Usage: </b><code>performance.py &lt;zookeeper&gt; &lt;row count&gt;</code><br />
-<b>Example: </b><code>performance.py localhost 1000000</code><br />
-<br />
-This will create a new table named <code>performance_1000000</code> and upsert 1000000 rows. The schema and data generated is similar to <code>examples/web_stat.sql</code> and <code>examples/web_stat.csv</code>. On the console it will measure the time it takes to:<br />
-<ul>
-<li>upsert these rows</li>
-<li>run queries that perform <code>COUNT</code>, <code>GROUP BY</code>, and <code>WHERE</code> clause filters</li>
-</ul>
-For convenience, an <code>hbase-site.xml</code> file is included in the bin directory and pre-configured to already be on the classpath during script execution.<br />
-<br />
-Here is a screenshot of the performance.py script in action:<br />
-<div class="separator" style="clear: both; text-align: center;">
-<a href="http://1.bp.blogspot.com/-VhinivNOJmI/URWBGLYTiHI/AAAAAAAAAQU/Dp9lbH2CxYE/s1600/performance_script.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="http://1.bp.blogspot.com/-VhinivNOJmI/URWBGLYTiHI/AAAAAAAAAQU/Dp9lbH2CxYE/s640/performance_script.png" width="497" /></a></div>
-<h3>
-&nbsp;Conclusion</h3>
-Phoenix has many knobs and dials to tailor the system to your use case. From controlling the level of parallelization, to the size of batches, to the consumption of resource, <i>there's a knob for that</i>. &nbsp;These controls are not without there limitations, however. There's still more work to be done and we'd love to hear your ideas on what you'd like to see made more configurable.<br />
-<br />
+Phoenix breaks up aggregate queries into multiple scans and runs them in parallel through custom aggregating coprocessors to improve performance.&nbsp;Hari Kumar, from Ericsson Labs, did a good job of explaining the performance benefits of parallelization and coprocessors <a href="http://labs.ericsson.com/blog/hbase-performance-tuners" target="_blank">here</a>.
+
+As of 3.2/4.2, parallelization in Phoenix is driven by the guideposts as determined by the configuration parameters for
+[statistics collection](http://phoenix.apache.org/update_statistics.html). Each chunk of data between guideposts
+will be run in parallel in a separate scan to improve query performance. Note that at a minimum, separate scans will be
+run for each table region. Beyond the statistics collection configuration parameters, the client-side
+<code>phoenix.query.threadPoolSize</code> and <code>phoenix.query.queueSize</code> parameters and the server-side
+<code>hbase.regionserver.handler.count</code> parameter have an impact on performance.
+

Modified: phoenix/site/source/src/site/markdown/update_statistics.md
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/markdown/update_statistics.md?rev=1635051&r1=1635050&r2=1635051&view=diff
==============================================================================
--- phoenix/site/source/src/site/markdown/update_statistics.md (original)
+++ phoenix/site/source/src/site/markdown/update_statistics.md Wed Oct 29 05:38:43 2014
@@ -2,11 +2,11 @@
 
 The UPDATE STATISTICS command updates the statistics collected on a table, to improve query performance.
 This command collects a set of keys per region per column family that are equal byte distanced from each other.
-These collected keys are called *guideposts* and they act as *hints/guides* to improve the parallelization of queries on a given target 
-region.
+These collected keys are called *guideposts* and they act as *hints/guides* to improve the parallelization of
+queries on a given target region.
 
-The statistics are also collected during major compaction and when ever a region split happens so manually running the command
-may not be necessary.
+Statistics are also automatically collected during major compactions and region splits so manually running this
+command may not be necessary.
 
 ## Examples
 
@@ -14,8 +14,8 @@ For a given table <code>my_table</code>:
 
     UPDATE STATISTICS <code>my_table</code>
 
-The above syntax would collect the statistics for the table my_table and all the index tables, views and view index tables associated
-with the table my_table.
+The above syntax would collect the statistics for the table my_table and all the index tables, views and
+view index tables associated with the table my_table.
 
 The equivalent of the above syntax is
 
@@ -31,19 +31,33 @@ To collect the statistics on the table a
 
 ## Configurations
 
-Some of the configurations associated with the UPDATE STATISTICS command are
+The configuration parameters controlling statistics collection include:
 
 1.  <code>phoenix.stats.guidepost.width</code>
-    * A server-side configuration that specifies the absolute byte value that determines the number of bytes between the collected
-      guideposts.
+    * A server-side parameter that specifies the number of bytes between guideposts.
+      A smaller amount increases parallelization, but also increases the number of
+      chunks which must be merged on the client side.
+    * The default value is 104857600 (100 MB).
 2.  <code>phoenix.stats.guidepost.per.region</code>
-    * Determines the number of guideposts per region.  If the configuration 'phoenix.stats.guidepost.width' is not set then the
-     MAX_FILE_SIZE associated with the table (for which statistics collection is needed) divided by the values set for this 
-     configuration will be used for 'phoenix.stats.guidepost.width'.
-    * The default value is 20
-3.  <code>phoenix.stats.minUpdateFrequency</code>
-    * Minimum time in milliseconds that must be passed before another UPDATE STATISTICS call be issued once again to collect the
-     statistics again.
-4.  <code>phoenix.stats.updateFrequency</code>
-    * Minimum frequency in milliseconds that new statistics will be checked for when pulling over new metadata from the client to the
-     server.
+    * A server-side parameter that specifies the number of guideposts per region.
+      If set to a value greater than zero, then the guidepost width is determiend by
+      the MAX_FILE_SIZE of the table divided by this value. Otherwise, if not set
+      then the <code>phoenix.stats.guidepost.width</code> parameter is used.
+    * No default value.
+3.  <code>phoenix.stats.updateFrequency</code>
+    * A server-side paramater that determines the frequency in milliseconds for which statistics
+      will be refreshed from the statistics table and subsequently used by the client.
+    * The default value is 900000 (15 mins)
+4.  <code>phoenix.stats.minUpdateFrequency</code>
+    * A client-side parameter that determines the minimum amount of time in milliseconds that
+      must pass before statistics may again be manually collected through another <code>UPDATE
+      STATISTICS</code> call.
+    * The default value is <code>phoenix.stats.updateFrequency</code> divided by two (7.5 mins)
+5. <code>phoenix.stats.useCurrentTime</code>
+    * An advanced server-side parameter that if true causes the current time on the server-side
+      to be used as the timestamp of rows in the statistics table when background tasks such as
+      compactions or splits occur. If false, then the max timestamp found while traversing the
+      table over which statistics are being collected is used as the timestamp. Unless your
+      client is controlling the timestamps while reading and writing data, this parameter
+      should be left alone.
+    * The default value is true.

Added: phoenix/site/source/src/site/resources/presentations/HBaseCon2014-16x9.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/resources/presentations/HBaseCon2014-16x9.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/source/src/site/resources/presentations/HBaseCon2014-16x9.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: phoenix/site/source/src/site/resources/presentations/HadoopSummit2014-16x9.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/resources/presentations/HadoopSummit2014-16x9.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/source/src/site/resources/presentations/HadoopSummit2014-16x9.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: phoenix/site/source/src/site/resources/presentations/OC-HUG-2014-10-4x3.pdf
URL: http://svn.apache.org/viewvc/phoenix/site/source/src/site/resources/presentations/OC-HUG-2014-10-4x3.pdf?rev=1635051&view=auto
==============================================================================
Binary file - no diff available.

Propchange: phoenix/site/source/src/site/resources/presentations/OC-HUG-2014-10-4x3.pdf
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream