You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2007/09/13 06:27:01 UTC
[Lucene-hadoop Wiki] Update of "Hbase/HbaseShell/Altools" by udanax

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseShell/Altools

New page:
[[TableOfContents(5)]]
----
 ''-- This project is currently in the planning stage.  [https://issues.apache.org/jira/browse/HADOOP-1608 HADOOP-1608] to add "Relational Algrebra Operators" is currently in process.[[BR]]-- If you have constructive ideas, Please advise me. [[MailTo(webmaster AT SPAMFREE udanax DOT org)]]''

= Hbase Shell Altools Plan =

Hbase altools is an Hbase Shell sub 'interpreter' (or 'shell)' program to provide scalable data processing capabilities like  aggregation, algebraic calculation(groups and sets, commutative rings, algebraic geometry, and linear algebra) on Hadoop + Hbase based parallel machines. especially, it will focus on storing and manipulating numeric, sparse matrices on Hbase.

Altools operations will show or explain how Google search's LSI, Google Earth's algebraic topology, Google News' recommendation system are related to Bigtable.

== Background ==
I expect Hadoop + Hbase to handle sparsity and data explosion very well in near future. Moreover, i believe the design of the multi-dimensional map structure and the 3d space model of the data are optimized for rapid ad-hoc information retrieval in any orientation, as well as for fast, flexible calculation and transformation of raw data based on formulaic relationships. It is advantageous with respect to Analysis Processing as it allows users to easily formulate complex queries, and filter or slice data into meaningful subsets, among other things.

=== Parallelization ===

Altools provides automatic parallelization of the most time-consuming relational/matrix/vector operations, and will ensure that the iterative solvers are scalable.

 * Survey List
  * Parallel Processing of Relational Data
  * Parallel Algorithms of Multi-Dimensional Matrix Operations
  * Parallel Gaussian Elimination Algorithm

----

= Suggested Hbase altools Syntax =
'''Note''' that Data should be located by their row, column, and timestamp.

== Commands ==
||<bgcolor="#E5E5E5">'''Command''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
||Table ||<99%>'''Table''' command loads specified table. [[BR]][[BR]]~-''Table('table_name');''-~ ||
||Matrix ||<99%>'''Matrix''' command constructs the configuration of the logic matrix.[[BR]]'''Options''' : features not yet. [[BR]][[BR]]~-''Matrix(table_name, columnfamily_name[, option]);''-~ ||
||Substitute ||<99%>'''Substitute''' specific algebraic expression to [A~Z][[BR]][[BR]]~-''A = Projection(column-list);''-~ ||
||IF...ELSE ||<99%>'''IF...ELSE''', Imposes conditions on the execution. [[BR]][[BR]]~-''IF ( boolean_expression )[[BR]]{{{    }}}B = command_statements;[[BR]]ELSE[[BR]]{{{    }}}B = command_statements;''-~||
||Store ||<99%>'''Store''' command will store results to specified table or file. [[BR]][[BR]]~-''A = Table('table_name'); [[BR]]B = A.Selection(condition_expression); [[BR]]Store B TO table(result_table)[or file('result_file_name')];''-~ ||

'''Type''' 'help;' for Hbase altools usage.

{{{
HBase > altools;

Hbase altools, 0.0.1 version
Type 'help;' for Hbase altools usage.

Hbase.altools > help;
}}}

== Relational Operators ==
||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
||Projection ||<99%>'''Projection''' of a relation ~+R+~, It makes a new relation as the set that is obtained when all tuples(rows) in ~+R+~ are restricted to the set {columnfamily,,1,,,...,columnfamily,,n,,}.[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = A.Projection(column-list);{{{    }}}'''//π,,column-list,,(A)''' ''-~ ||
||Selection ||<99%>'''Selection''' of a relation ~+R+~, It makes a new relation as the set of specified tuples(rows) of the relation ~+R+~.[[BR]]'''Set Operations''' : ~-''OR, AND, NOT''-~[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = A.Selection(condition_expression);{{{    }}}'''//σ,,condition,,(A)''' ''-~ ||
||JOINs ||<99%>Table '''JOIN''' operations, linking and extracting data from two different internal source.[[BR]]'''Operations''' : ~-''naturalJoin(), thetaJoin(), cartesianProduct() ''-~ [[BR]][[BR]]~-''R = Table('table_name1');[[BR]]S = Table('table_name2');[[BR]]C = R.naturalJoin(S);{{{    }}}'''//C = R▷◁S''' ''-~ ||
||Group ||<99%>'''Group''' tuples by value of an attribute and apply aggregate function independently to each group of tuples.[[BR]]'''Aggregate Functions''' : ~-''AVG(attribute), SUM(attribute), COUNT(attribute), MIN(attribute), MAX(attribute)''-~[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = A.Group(column-list);{{{    }}}'''//γ,,column-list,,(A)''' ''-~ ||
||Sort ||<99%>'''Sort''' of tuples(rows) of R, ordered according to columnfamilies on columnfamily-list.[[BR]][[BR]]~-''A = Table('table_name');[[BR]]B = Sort A by (column-list);{{{    }}}'''//τ,,column-list,,(A)''' ''-~ ||

'''(ex. 1)''' Search the subject and the year of the movies which were produced by 'Fox' company and where running time is more than 100 minutes.
[[BR]]~-'''''π ,,title.year,, (σ ,,length > 100,, (movieLog_table) ∩ σ ,,studioName = 'Fox',, (movieLog_table))'''''-~

{{{
Hbase.altools > A = Table('movieLog_table'); 
Hbase.altools > B = A.Selection(length > 100 AND studioName = 'Fox'); 
Hbase.altools > C = B.Projection('year'); 

Hbase.altools > store C to table('result_table'); 
}}}

'''(ex. 2)''' Theta Join : ▷◁,,C,,
[[BR]]~-'''''movieStars_table▷◁,,actor < year,,movieLog_table = σ,,actor < year,,(movieStars_table X movieLog_table)'''''-~

{{{
Hbase.altools > A = Table('movieStars_table'); 
Hbase.altools > B = Table('movieLog_table');
Hbase.altools > C = A.thetaJoin(B);

Hbase.altools > store C to table('result_table'); 
}}}

'''(ex. 3)''' Find the year of the earliest movie for each actor.
[[BR]]~-'''''γ ,,starName.MIN(year) → minYear,, (movieStars_table)'''''-~

{{{
Hbase.altools > A = Table('movieStars_table');
Hbase.altools > B = A.Group('starName',MIN('year'));

Hbase.altools > store B to table('result_table');
}}}

== Matrix Operators ==
'''Note''' that matrix operations are the core of many linear systems.
=== Arithmetic Operators ===
||<bgcolor="#E5E5E5">'''Operator''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
||Addition ||<99%>'''Adding''' entries with the same indices. [[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C = A + B;{{{    }}}'''// c,,ij,, = a,,ij,, + b,,ij,, (i : row key, j : column key)''' ''-~ ||
||Subtraction ||<99%>'''Subtracting''' entries with the same indices.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C = A - B;{{{    }}}'''// c,,ij,, = a,,ij,, - b,,ij,, (i : row key, j : column key)''' ''-~ ||
||Multiplication ||<99%>'''Multiplication''' of two matrices, Product C of two matrices A and B.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C = A * B;{{{    }}}'''//C = A · B''' ''-~ ||
||Division ||<99%>'''Division''' is solving the matrix equation AX = B for X.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Matrix('table_name2','columnfamily_name2');[[BR]]C = A /[or \] B;{{{    }}}'''// C = A / B''' ''-~||
||Transpose ||<99%>'''Transpose''' of a Matrix, A matrix which is formed by turning all the rows of a given matrix into columns and vice-versa.[[BR]][[BR]]~-''A = Matrix('table_name1','columnfamily_name1');[[BR]]B = Transpose(A);{{{    }}}'''// B = A'''' ''-~||


'''(ex. 1)''' Matrix Addition
[[BR]]~-'''''C = A + B = (a,,ij,, + b,,ij,,)'''''-~

{{{
//Set up the matrix A, B from mapped matrix in hbase.

Hbase.altools > A = Matrix('m_table','cf_1');
Hbase.altools > B = Matrix('m_table','cf_2');
Hbase.altools > C = A + B;
}}}


'''(ex. 2)''' The product C of two matrices A and B
[[BR]]~-'''''C,,ij,, = ΣA,,ik,,B,,kj,, (1 ≤ i ≤ m , 1 ≤ j ≤n)'''''-~

{{{
//Set up the matrix A, B from mapped matrix in hbase.

Hbase.altools > A = Matrix('m_table','cf_1');
Hbase.altools > B = Matrix('m_table','cf_2');
Hbase.altools > C = A * B;
}}}

=== Factorization and Decomposition Operators ===

||<bgcolor="#E5E5E5">'''Function''' ||<bgcolor="#E5E5E5">'''Explanation''' ||
||LU ||<99%>'''LU Decomposition'''[[BR]]A procedure for decomposing an N by N matrix A into a product of a lower triangular matrix L and an upper triangular matrix U, LU = A.[[BR]]'''Functions''' : ~-''getL(), getU(), isSingular(), getPivot()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B = LUDecomposition(A);[[BR]]C = B.getU();[[BR]]D = B.getL();''-~||
||QR ||<99%>'''QR Decomposition'''[[BR]]For an m-by-n matrix A with m >= n, the QR decomposition is an m-by-n orthogonal matrix Q and an n-by-n upper triangular matrix R so that A = Q*R.[[BR]]'''Functions''' : ~-''getH(), getQ(), getR()''-~[[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B = QRDecomposition(A);[[BR]]C = B.getH();''-~||
||Cholesky ||<99%>'''Cholesky Decomposition'''[[BR]]It is a special case of LU decomposition applicable only if matrix to be decomposed is symmetric positive definite.[[BR]]'''Functions''' : ~-''getL(), getU(), isSPD()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B = CholeskyDecomposition(A);[[BR]]C = B.getL();''-~||
||SVD ||<99%>'''SVD(Singular Value Decomposition)'''[[BR]]For an m-by-n matrix A with m >= n, the singular value decomposition is an m-by-n orthogonal matrix U, an n-by-n diagonal matrix S, and an n-by-n orthogonal matrix V so that A = U*S*V'.[[BR]]'''Functions''' : ~-''getS(), getU(), getV(), norm2(), rank()''-~ [[BR]][[BR]]~-''A = Matrix('table_name','columnfamily_name');[[BR]]B = SVDdecomposition(A);[[BR]]C = B.getU();''-~||

{{{
//Set up the matrix M from mapped matrix in hbase.
Hbase.altools > M = Matrix('m_table','cf_1'); 

M ([1, 2],
   [3, 4])
}}}

'''(ex. 1)''' To find the Singular Value decomposition in Altools, do the following:
[[BR]]~-'''''M = UΣV*'''''-~

{{{
Hbase.altools > A = M.SVDdecomposition();
Hbase.altools > U = A.getU();
Hbase.altools > S = A.getS();
Hbase.altools > V = A.getV();

U ([[-0.40455358, -0.9145143 ],
    [-0.9145143 ,  0.40455358]])

S ([ 5.4649857 ,  0.36596619])

V ([[-0.57604844, -0.81741556],
    [ 0.81741556, -0.57604844]])
}}}

'''(ex. 2)''' To find the QR decomposition in Altools, do the following:
[[BR]]~-'''''M = QR'''''-~

{{{
Hbase.altools > A = M.QRDecomposition();
Hbase.altools > U = A.getQ();
Hbase.altools > U = A.getR();

Q ([[-0.31622777, -0.9486833 ],
    [-0.9486833 ,  0.31622777]])

R ([[-3.16227766, -4.42718872],
    [ 0.        , -0.63245553]])
}}}
----
= Example Of Altools Use =
== Latent Semantic Analysis by SVD ==
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA was patented in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI). -- wikipedia

This LSA example used SVD decomposition with k=3. 

The SVD is typically computed using large matrix methods (for example, Lanczos methods) but may also be computed incrementally and with greatly reduced resources via a neural network-like approach which does not require the large, full-rank matrix to be held in memory -- wikipedia

 * ~-'''NOTATION'''-~
  * ~-''T,,0,,'' : orthogonal, unit-length columns-~
  * ~-''D,,0,,'' : orthogonal, unit-length columns-~
  * ~-''S,,0,,'' : eigenvalues of a diagonal matrix-~
  * ~-''t'' : Row number of matrix X-~
  * ~-''d'' : Column number of matrix X-~
  * ~-''m'' : Rank of matrix X ( ≤ min(t,d) )-~

I made a new relation "t_table" using "relational operators" from raw data(web_table) as described below :
[[BR]]~-''TODO: do explain "theoretical way of manipulating table(web_table) using relational operators"''-~

||Row Key ||<-6>Column Families ||
||<rowbgcolor="#ececec">term ||<-2> corpus: ||<-2>link: ||<-2>.. ||
||t01 ||corpus:c01 ||0 ||link:cnn.com ||1 ||.. ||.. ||
|| ||corpus:c03 ||0 ||  ||  ||.. ||.. ||
||t02 ||corpus:c01 ||1 || || ||.. ||.. ||
|| ||corpus:c02 ||0 ||link:google.com  ||0  || || ||
||.. ||.. ||.. ||.. ||.. ||.. ||.. ||

{{{
//Set up the matrix X
Hbase.altools > X = Matrix('t_table','corpus'); 
}}}

This example used tf*idf for more efficient and economical calculations.

 * ~-''w,,ij,, = tf,,ij,, '''·''' log( N/df,,j,, )''-~
  * ~-''w'' : weighted value-~
  * ~-''tf'' : the number of times the word appears in a i-th document-~
  * ~-''N'' : the number of documents of an entire corpus-~
  * ~-''df'' : the number of documents containing the j-th term-~

||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
||<bgcolor="#ececec">t01 ||0.00 ||0.00 ||0.15 ||.. ||
||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.00 ||.. ||
||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.00 ||.. ||
||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||

Weighted Matrix W

{{{
//Diagonal eigenvalue S

Hbase.altools > M = W.SVDdecomposition(); 
Hbase.altools > S = M.getS();
}}}

 * ~-''X = T,,0,,S,,0,,D,,0,,''-~

{{{
//Othonormal eigenvector T, D

Hbase.altools > T = M.getU();
Hbase.altools > D = M.getV();
}}}

SVD allows a simple strategy for optimal approximate fit using smaller matrices. If the singular
values in S0 are ordered by size, the first k largest values may be kept and the remaining smaller ones
set to zero. The product of the resulting matrices is a matrix X′ which is only approximately equal to X,
and is of rank k. Since zeros were introduced into S0, the representation can be simplified by deleting
the zero rows and columns of S0 to obtain a new diagonal matrix S, and deleting the corresponding
columns of T0 and D0 to obtain T and D respectively. The result is a reduced model :

 * ~-''X′ = TSDT ≈ X''-~

which is the rank-k model with the best possible least square fit to X.

||<rowbgcolor="#ececec"> ||c01 ||c02 ||c03 ||.. ||
||<bgcolor="#ececec">t01 ||0.18 ||0.03 ||0.12 ||.. ||
||<bgcolor="#ececec">t02 ||0.08 ||0.00 ||0.30 ||.. ||
||<bgcolor="#ececec">t02 ||0.00 ||0.26 ||0.07 ||.. ||
||<bgcolor="#ececec">.. ||.. ||.. ||.. ||.. ||

Reduced Matrix R

----
= Papers =
 * [http://www.uib.no/People/nmabh/art/hpj.pdf High performance numerical libraries in Java]
 * ''[http://labs.google.com/papers/bigtable.html Bigtable] : A Distributed Storage System for Structured Data''
 * ''Interpreting the Data: Parallel Analysis with [http://labs.google.com/papers/sawzall.html Sawzall]''
 * ''Y!'s Research Project : [http://research.yahoo.com/project/pig Pig] Document''
 * ''[http://portal.acm.org/citation.cfm?doid=1247480.1247602 Map-Reduce-Merge] : Simplified Relational Data Processing on Large Clusters''
 * ''[http://www.pytables.org/ PyTables] : Hierarchical Datasets in Python''
 * ''[http://numpy.scipy.org/ Numpy] : Scientific Tools for Python''
 * ''[http://db.lcs.mit.edu/projects/cstore/ C-Store] : A Column Oriented DBMS''