You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rya.apache.org by Brian McBride <br...@epimorphics.com> on 2016/03/23 12:09:37 UTC

Spark RDD

I'm looking for options for processing RDF data in a hadoop style cluster.

My understanding from my reading so far is that using Rya, I could have 
a quad store with the data partitioned across my cluster.  I could send 
SPARQL queries Rya's SPARQL endpoint.

I would like to do bulk, hadoop style, processing of the RDF data in my 
quad store.

I am wondering whether it is possible (i.e. there is a supported 
interface, not delving under the hood) that would allow me to distribute 
my application across my cluster and have my application processing 
occur close to the machine on which the data is held?

Something that would return, for example, a SPARQK RDD of quads matching 
some filter.

Brian

-- 
Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)


Re: Spark RDD

Posted by "Aaron D. Mihalik" <aa...@gmail.com>.
You should take a look at  the Pig script processing in Rya [1][2] and the
Fluo work that was just checked in [3].

The accumulo.pig transforms a SPARQL query in to a Pig script which later
becomes a series of map-reduce jobs.

The rya.pcj.fluo is very similar, but uses a mini-batch approach to
incrementally update results.

We'd greatly appreciate a Spark/RDD implementation, if you care to work on
one :)

--Aaron

[1]
https://github.com/apache/incubator-rya/blob/develop/extras/rya.manual/src/site/markdown/loadPrecomputedJoin.md
[2] https://github.com/apache/incubator-rya/tree/develop/pig/accumulo.pig
[3] https://github.com/apache/incubator-rya/tree/develop/extras/rya.pcj.fluo

On Wed, Mar 23, 2016 at 8:43 AM Brian McBride <br...@epimorphics.com> wrote:

> I'm looking for options for processing RDF data in a hadoop style cluster.
>
> My understanding from my reading so far is that using Rya, I could have
> a quad store with the data partitioned across my cluster.  I could send
> SPARQL queries Rya's SPARQL endpoint.
>
> I would like to do bulk, hadoop style, processing of the RDF data in my
> quad store.
>
> I am wondering whether it is possible (i.e. there is a supported
> interface, not delving under the hood) that would allow me to distribute
> my application across my cluster and have my application processing
> occur close to the machine on which the data is held?
>
> Something that would return, for example, a SPARQK RDD of quads matching
> some filter.
>
> Brian
>
> --
> Epimorphics Ltd, http://www.epimorphics.com
> Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20
> 6PT
> Epimorphics Ltd. is a limited company registered in England (number
> 7016688)
>
>

Re: Spark RDD

Posted by Brian McBride <br...@epimorphics.com>.
Aaron replied (I can see his reply in the mail archive but it has not 
reached my intray yet)

[[

You should take a look at  the Pig script processing in Rya [1][2] and the
Fluo work that was just checked in [3].

The accumulo.pig transforms a SPARQL query in to a Pig script which later
becomes a series of map-reduce jobs.

The rya.pcj.fluo is very similar, but uses a mini-batch approach to
incrementally update results.

We'd greatly appreciate a Spark/RDD implementation, if you care to work on
one :)
]]

Thanks for the pointers Aaron.  I will look at those.

Re Spark/RDD implementation - its not impossible I might have a go at 
that, but I wouldn't hold your breath.  My current task is short term 
and its not clear what happens next or at what pace.

Brian


On 23/03/16 11:09, Brian McBride wrote:
> I'm looking for options for processing RDF data in a hadoop style 
> cluster.
>
> My understanding from my reading so far is that using Rya, I could 
> have a quad store with the data partitioned across my cluster.  I 
> could send SPARQL queries Rya's SPARQL endpoint.
>
> I would like to do bulk, hadoop style, processing of the RDF data in 
> my quad store.
>
> I am wondering whether it is possible (i.e. there is a supported 
> interface, not delving under the hood) that would allow me to 
> distribute my application across my cluster and have my application 
> processing occur close to the machine on which the data is held?
>
> Something that would return, for example, a SPARQK RDD of quads 
> matching some filter.
>
> Brian
>

-- 
Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)