You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Rob Vesse <rv...@dotnetrdf.org> on 2014/05/29 16:16:21 UTC

Hadoop RDF Tools - What's Next?

Hey All

So as people have probably seen the IP Clearance for the Hadoop RDF Tools
donation is now completed so we can now officially adopt the code and start
developing it further.  This email is to get community input on what should
be a priority and any final clean up tasks needed.

Firstly in terms of clean up I think there are two things to do:
* Remove @author tags in Javadoc (this is a no-brainer and I'll do this
later today)
* Rename the libraries
Re: renaming - I am concerned that by calling these libraries Hadoop RDF
Tools we are falling foul of the ASF trademarks policy.  In order to
properly comply with the policy it would be best to rename the libraries as
"Apache Jena RDF Tools for Apache Hadoop" thus making it clear that it is
the Apache Jena project responsible for them and that they target Apache
Hadoop.  Does this naming make sense?

Alongside this I think we should also add the jena- prefix to all the Maven
artefact IDs e.g. hadoop-rdf-common -> jena-hadoop-rdf-common again making
it clear that these artifacts are from the Jena project.  Though the
org.apache.jena group ID mostly serves that purpose it won't really hurt to
be thorough in this regard.

Secondly there is the issue of what's next development wise.  Those who have
read the presentation attached to the associated JIRA (JENA-666) will know
that there was a bunch of future work enumerated in that document based on
Cray's thinking about the project but now this is part of Apache Jena the
future direction should be driven by community needs.  The main directions
for future work that I personally am considering right now are as follows:
1. Clean up and fixes to existing code base
2. Native node and tuple containers i.e. Paul Houle's suggestion about
storing things as native types wherever possible and lazily translating to
Node/Triple/Quad etc only when necessary
3. Using binary comparisons wherever possible to avoid deserialisation costs
and boost performance
4. Improving configuration e.g. specifying namespaces to use for output
I'd like to know what people in the community considering using these
libraries would like to see?  Are there obvious things I've not thought of?
Would you prioritise the above items differently? Any other thoughts?

Like all parts of Jena this should be owned and maintained by the community
as much as is possible, if people have things they'd like to have a go at
implementing themselves then please go ahead.  Feel free to drop emails to
this list with questions, thoughts, requests for help etc

Cheers,

Rob