You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rya.apache.org by Brian McBride <br...@epimorphics.com> on 2016/03/23 12:09:37 UTC
Spark RDD
I'm looking for options for processing RDF data in a hadoop style cluster.
My understanding from my reading so far is that using Rya, I could have
a quad store with the data partitioned across my cluster. I could send
SPARQL queries Rya's SPARQL endpoint.
I would like to do bulk, hadoop style, processing of the RDF data in my
quad store.
I am wondering whether it is possible (i.e. there is a supported
interface, not delving under the hood) that would allow me to distribute
my application across my cluster and have my application processing
occur close to the machine on which the data is held?
Something that would return, for example, a SPARQK RDD of quads matching
some filter.
Brian
--
Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)
Re: Spark RDD
Posted by "Aaron D. Mihalik" <aa...@gmail.com>.
You should take a look at the Pig script processing in Rya [1][2] and the
Fluo work that was just checked in [3].
The accumulo.pig transforms a SPARQL query in to a Pig script which later
becomes a series of map-reduce jobs.
The rya.pcj.fluo is very similar, but uses a mini-batch approach to
incrementally update results.
We'd greatly appreciate a Spark/RDD implementation, if you care to work on
one :)
--Aaron
[1]
https://github.com/apache/incubator-rya/blob/develop/extras/rya.manual/src/site/markdown/loadPrecomputedJoin.md
[2] https://github.com/apache/incubator-rya/tree/develop/pig/accumulo.pig
[3] https://github.com/apache/incubator-rya/tree/develop/extras/rya.pcj.fluo
On Wed, Mar 23, 2016 at 8:43 AM Brian McBride <br...@epimorphics.com> wrote:
> I'm looking for options for processing RDF data in a hadoop style cluster.
>
> My understanding from my reading so far is that using Rya, I could have
> a quad store with the data partitioned across my cluster. I could send
> SPARQL queries Rya's SPARQL endpoint.
>
> I would like to do bulk, hadoop style, processing of the RDF data in my
> quad store.
>
> I am wondering whether it is possible (i.e. there is a supported
> interface, not delving under the hood) that would allow me to distribute
> my application across my cluster and have my application processing
> occur close to the machine on which the data is held?
>
> Something that would return, for example, a SPARQK RDD of quads matching
> some filter.
>
> Brian
>
> --
> Epimorphics Ltd, http://www.epimorphics.com
> Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20
> 6PT
> Epimorphics Ltd. is a limited company registered in England (number
> 7016688)
>
>
Re: Spark RDD
Posted by Brian McBride <br...@epimorphics.com>.
Aaron replied (I can see his reply in the mail archive but it has not
reached my intray yet)
[[
You should take a look at the Pig script processing in Rya [1][2] and the
Fluo work that was just checked in [3].
The accumulo.pig transforms a SPARQL query in to a Pig script which later
becomes a series of map-reduce jobs.
The rya.pcj.fluo is very similar, but uses a mini-batch approach to
incrementally update results.
We'd greatly appreciate a Spark/RDD implementation, if you care to work on
one :)
]]
Thanks for the pointers Aaron. I will look at those.
Re Spark/RDD implementation - its not impossible I might have a go at
that, but I wouldn't hold your breath. My current task is short term
and its not clear what happens next or at what pace.
Brian
On 23/03/16 11:09, Brian McBride wrote:
> I'm looking for options for processing RDF data in a hadoop style
> cluster.
>
> My understanding from my reading so far is that using Rya, I could
> have a quad store with the data partitioned across my cluster. I
> could send SPARQL queries Rya's SPARQL endpoint.
>
> I would like to do bulk, hadoop style, processing of the RDF data in
> my quad store.
>
> I am wondering whether it is possible (i.e. there is a supported
> interface, not delving under the hood) that would allow me to
> distribute my application across my cluster and have my application
> processing occur close to the machine on which the data is held?
>
> Something that would return, for example, a SPARQK RDD of quads
> matching some filter.
>
> Brian
>
--
Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)