You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2007/09/13 11:29:25 UTC

[Lucene-hadoop Wiki] Trivial Update of "Hbase/RDF" by udanax

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

The comment on the change is:
Sync from hadoop.co.kr

------------------------------------------------------------------------------
  [[TableOfContents(4)]]
  ----
- == HbaseRDF, an Hbase Subsystem for RDF ==
- 
   -- ''Volunteers and any comments on HbaseRDF are welcomed.''
+ == HbaseRDF, a Planet-Scale RDF Data Store ==
  
  We have started to think about storing and querying RDF data in Hbase. But we'll jump into its implementation after prudence investigation. 
  
- We call for the introduction of an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
+ We introduce an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + MapReduce to store RDF data and execute queries (e.g., SPARQL) on them.
  We can store very sparse RDF data in a single table in Hbase, with as many columns as 
  they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. 
  This reduces costly self-joins in answering queries asking questions on the same subject, which results in efficient processing of queries, although we still need self-joins to answer RDF path queries.
@@ -23, +22 @@

   
  === Initial Contributors ===
  
-  * [:udanax:Edward Yoon] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] (Research and Development center, NHN corp.)
-  * [:InchulSong: Inchul Song] [[MailTo(icsong AT SPAMFREE gmail DOT com)]] (Database Lab, KAIST) 
- 
+  * [:udanax:Edward Yoon] (R&D center, NHN corp.)
+  * [:InchulSong: Inchul Song] (Database Lab, KAIST) 
+ ----
  == Some Ideas ==
  When we store RDF data in a single Hbase table and process queries on them, an important issue we have to consider is how to efficiently perform costly self-joins needed to process RDF path queries. 
  
@@ -54, +53 @@

  we are ready to do some massive parallel query processing on tremendous amount of RDF data.
  Currently, C-Store shows the best query performance on RDF data.
  However, we, armed with Hbase and MapReduceMerge, can do even better.
- 
+ ----
  == Resources ==
   * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a candidate recommendation of W3C as of 14 June 2007.
   * A test suit for SPARQL can be found at http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF data, SPARQL queries, and expected results.
@@ -77, +76 @@

  
  Query processing steps are as follows:
  
-  * Parsing, in which a parse tree, representing the SPARQL query is constructed.
-  * Query rewrite, in which the parse tree is converted to an initial query plan, which is, in turn, transformed into an equivalent plan that is expected to require less time to execute. We have to choose which algorithm to use for each operation in the selected plan. Among them are parallel versions of algorithms, such as parallel joins with MapReduceMerge.
-  * Execute the plan
-  
+ {{{
+ SPARQL query -> Parse tree -> Logical operator tree 
+ -> Physical operator tree -> Execution
+ }}}
+ 
+ Implemenation of each step may proceed as an individual issue. 
+ 
  === HbaseRDF Data Materializer ===
  HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the results
  into a Hbase table. Later, HQP uses those materialized data for efficient processing of 
@@ -106, +108 @@

  
  Hbase > 
  }}}
- 
+ ----
  == Alternatives ==
   * A triples table stores RDF triples in a single table with three attributes, subject, property, and object.
  
   * A property table. Put properties frequently queried togather into a single table to reduce costly self-joins. Used in Jena and Oracle. 
  
   * A dicomposed storage model (DSM), one table for each property, sorted by the subject. Used in C-Store.
+ ----
-   * ''Actually, the discomposed storage model is almost the same as the storage model in Hbase.''
- 
  == Papers ==
  
-  * ''OSDI 2004, MapReduce: Simplified Data Processing on Large Clusters"[[BR]]- proposes a very simple, but powerfull, and highly parallelized data processing technique.''
-  * ''CIDR 2007, [http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For Wide and Sparse Data][[BR]]- discusses the benefits of using C-Store to store RDF and XML data.''
-  * ''VLDB 2007, [http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic Web Data Management Using Vertical Partitoning][[BR]]- proposes an efficient method to store RDF data in table projections (i.e., columns) and executes queries on them.''
-  * ''SIGMOD 2007, Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters[[BR]]- MapReduce implementation of several relational operators.''
+  * OSDI 2004, ''MapReduce: Simplified Data Processing on Large Clusters''
+   * proposes a very simple, but powerfull, and highly parallelized data processing technique.
+  * CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf Column-Stores For Wide and Sparse Data]''
+   * discusses the benefits of using C-Store to store RDF and XML data.
+  * VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable Semantic Web Data Management Using Vertical Partitoning]''
+   * proposes an efficient method to store RDF data in table projections (i.e., columns) and executes queries on them.
+  * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters''
+   * MapReduce implementation of several relational operators.