You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2009/03/03 19:54:34 UTC

Time Series Analysis using CloudBase

Hi,

<http://cloudbase.sourceforge.net/>[ CloudBase is a data warehouse system
built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its
query language and comes with a JDBC driver. It is developed by Business.com
and is released to open source community under GNU GPL license. One can
download CloudBase from sourceforge- http://cloudbase.sourceforge.net ]

CloudBase supports an aggregate function called BCAT. It is similar to other
aggregate functions (e.g. SUM, COUNT etc) and it is used to concatenate
column values. For example if we have a table-

C1        C2
-----------------
A          10
B          20
C          30

then the query-

SELECT BCAT( C1) from tablename

would give ABC as result.

Though this seems very rudimentary, BCAT function makes it easier to write
simple SQL queries that can be used to do time series analysis. This is
explained with the help of the following examples-

Example 1:
--------------------

Let sessionID denote the unique id for a session, nodeID denote the unique
page id, timeID denote the time at which the user visited a particular page,
and revenue the revenue generated by that user on that page. The following
ANSI SQL query together with BCAT, aggregates sessions and total revenue for
sessions that have the same navigation path on the website:


SELECT path, COUNT( path ), SUM( pathRevenue )
FROM
(
   SELECT sessionID, BCAT( nodeID ) AS path, SUM( revenue ) AS pathRevenue
   FROM
   (
      SELECT sessionID, nodeID FROM <weblog table> ORDER BY timeID
    )
   GROUP BY sessionID
)
GROUP BY path


Example 2:
--------------------

Let us say we have a purchase table with the following columns- (userid,
item, date) and we want to find out all the users who purchased product X
(say ipod), then X accessory and then product Y (say iphone). The following
query gives us all such users-

SELECT userid, BCAT( item ) AS purchase_str
FROM
(
   SELECT userid, item, date FROM purchase ORDER BY date
) AS tmp
GROUP BY userid HAVING purchase_str LIKE ‘ipodaccessoryiphone'


We can modify the HAVING clause to achieve other results-


1) purchase=ipod <any number of things (0 to infinity)> iphone

SELECT userid, BCAT( item) AS purchase_str
FROM
(
  SELECT userid, item, date FROM purchase ORDER BY date
) AS tmp
GROUP BY userid HAVING purchase_str LIKE ‘ipod%iphone'


2) purchase = 'ipod accessory iphone' OR 'ipod iphone accessory'

SELECT userid, BCAT( item) AS purchase_str
FROM
(
   SELECT userid, item, date FROM purchase ORDER BY date
) AS tmp
GROUP BY userid HAVING purchase_str LIKE ‘ipodaccessoryiphone' OR
purchase_str LIKE ‘ipodiphoneaccessory'


3) Show all permutations of purchases
SELECT username, BCAT( item), count(*)
FROM
(
   SELECT username, item, date FROM purchase ORDER BY date
) AS tmp
GROUP BY username


One can also use expressions, string functions inside BCAT function. For
example to include spaces in the purchase string one can do-

SELECT BCAT( CONCAT( item, ‘ ‘ ) ) FROM purchase


Thanks,
Taran