You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2010/04/02 14:02:27 UTC

[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
------------------------------------------------------------------------------------------

                 Key: NUTCH-808
                 URL: https://issues.apache.org/jira/browse/NUTCH-808
             Project: Nutch
          Issue Type: Task
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar


We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 

We want at least the following capabilities:
- Using POJOs 
- Able to persist objects to at least HBase, Cassandra, and RDBMs 
- Able to efficiently serialize objects as task outputs from Hadoop jobs
- Allow native queries, along with standard queries 




Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar closed NUTCH-808.
-------------------------------

    Resolution: Fixed

We have decided to go on with implementing an ORM layer as per the discussion on NUTCH-811. Closing this issue. 

> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852840#action_12852840 ] 

Enis Soztutar commented on NUTCH-808:
-------------------------------------

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro serialization so that objects can be passed from Mapred. 


> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856360#action_12856360 ] 

Enis Soztutar commented on NUTCH-808:
-------------------------------------

bq. What do you mean by current implementation? NutchBase?
Indeed. In package o.a.n.storage deals with ORM (though not all classes)

bq. I know that Cascading have various Tape/Sink implementations including JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how they do it?
The way cascading does this is to convert Tuples (cascading data structure) to HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since they deal with only tuple -> table row, it is not that difficult. But again, cascading does not allow for mapping lists to columns, etc. 

bq. My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. 
Yeah, the more I think about it, the more I come to terms with custom implementation. However, I think we might benefit a lot from the ideas from JDO in the long term. Also, JDBC implementation may not be relevant for large scale deployments, but it will be a very nice side effect of the ORM layer, which will allow easy deployment, which in turn will hopefully bring more users. 

> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856349#action_12856349 ] 

Julien Nioche commented on NUTCH-808:
-------------------------------------

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. 

I know that Cascading have various Tape/Sink implementations including JDBC, HBase  but also SimpleDB. Maybe it would be worth having a look at how they do it?

> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-808:
--------------------------------

    Fix Version/s: 2.0

> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856124#action_12856124 ] 

Enis Soztutar commented on NUTCH-808:
-------------------------------------

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference implementation for Java Data objects (JDO). JDO is a java standard for persistence. A similar specification, named JPA is also a persistence standard, which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will not be useful for us (http://www.datanucleus.org/products/accessplatform/persistence_api.html). 

In JDO, the first step is to define the domain objects as POJOs. Then, the persistance metadata is specified either using annotations, XML or both. Then a byte code enhancer uses instrumentation to add required methods to the classes defined as @PersistanceCapable. The database tables can be generated by hand, automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase branch) to both MySQL and HBase. Although it took me a fair bit of time to set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, arrays, etc) to RDBMs using different strategies (such as serializing directly, using Joins, using Foreign Keys), I was not able to find a way to leverage HBase data model. For example, we want to be able to map lists and maps to columns in column families. Without such functionality using column oriented stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new datastore for datanucleus, which also implements Hadoop's Serialization, or use Avro to generate Java classes to be feed into JPOX enhancer, or else manually implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 2.0, for out-of-the-box use, ease of deployment and real-scale computing respectively. But obviously we cannot use DataNucleus out of the box either. 


ORM is inherently a hard problem. I propose we go ahead and make the changes to DataNucleus to see if it is feasible, and continue with it if it suits our needs. Of course, having a custom framework will also be great, so any feedback would be more than welcome. 

> Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira