You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ritesh Kumar Singh <ri...@gmail.com> on 2016/04/06 00:51:07 UTC

Accessing RDF triples using Flink

Hi,

I need some suggestions regarding accessing RDF triples from flink. I'm
trying to integrate flink in a pipeline where the input for flink comes
from SPARQL query on a Jena model. And after modification of triples using
flink, I will be performing SPARQL update using Jena to save my changes.

   - Are there any recommended input format for loading the triples to
   flink?
   - Will this use case be classified as a flink streaming job or a batch
   processing job?
   - How will loading of the dataset vary with the input size?
   - Are there any recommended packages/ projects for these type of
   projects?

Any suggestion will be of great help.

Regards,
Ritesh
https://riteshtoday.wordpress.com/

Re: Accessing RDF triples using Flink

Posted by Flavio Pompermaier <po...@okkam.it>.
Hi Ritesh,
Jena could store triples in NQuadsInputFormat that is an HadoopInputFormat
so that you can read data in effiient way with Flink. Unfortunately I
rembember that I had some problem usign it so I just export my Jena model
as NQuads so then I can parse it efficiently with Flink as a text file.
However the parsing with sesame 4 is more efficient in terms of speed and
garbage collection.

What I do is to convert every quad into a tuple5, group triples/quads by
subject and then apply some logic. The quads grouped by subject is what we
call "entiton atom" and combining them leads to an "entiton molecule" (i.e.
a graph rooted in some entiton atom).

We presented our work at FlinkForward 2015 in Berlin:
http://www.slideshare.net/FlinkForward/s-bartoli-f-popmermaier-a-semantic-big-data-companion
If you need some code that reads the nquads with Flink I can give you some
code, just write me in private!

Best,
Flavio

On Wed, Apr 6, 2016 at 3:57 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Hi Flavio,
>
>    1. How do you access your rdf dataset via flink? Are you reading it as
>    a normal input file and splitting the records or you have some wrappers in
>    place to convert the rdf data into triples? Can you please share some code
>    samples if possible?
>    2. I am using Jena TDB command line utilities to make queries against
>    the dataset in order to avoid java garbage collection issues. I am also
>    using Jena java APIs as a dependency but command line utils are way faster
>    (Though it comes with an extra requirement to have Jena command line utils
>    installed in the system). Main reason for this approach being able to pass
>    the string output from the command line to Flink as part of my pipeline.
>    Can you tell me your approach to this?
>    3. Should I dump my query output to a file and then consume it as a
>    normal input source for Flink?
>
>
> Basically, any help regarding this will be helpful.
>
> Regards,
> Ritesh
>
>
>
> Ritesh Kumar Singh
> [image: https://]about.me/riteshoneinamillion
>
> <https://about.me/riteshoneinamillion?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> On Wed, Apr 6, 2016 at 2:45 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Ho Ritesh,
>> I have sone experience with Rdf and Flink. What do you mean for accessing
>> a Jena model? How do you create it?
>>
>> From my experience reading triples from jena models is evil because it
>> has some problems with garbage collection.
>> On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <ri...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I need some suggestions regarding accessing RDF triples from flink. I'm
>>> trying to integrate flink in a pipeline where the input for flink comes
>>> from SPARQL query on a Jena model. And after modification of triples using
>>> flink, I will be performing SPARQL update using Jena to save my changes.
>>>
>>>    - Are there any recommended input format for loading the triples to
>>>    flink?
>>>    - Will this use case be classified as a flink streaming job or a
>>>    batch processing job?
>>>    - How will loading of the dataset vary with the input size?
>>>    - Are there any recommended packages/ projects for these type of
>>>    projects?
>>>
>>> Any suggestion will be of great help.
>>>
>>> Regards,
>>> Ritesh
>>> https://riteshtoday.wordpress.com/
>>>
>>
>

Re: Accessing RDF triples using Flink

Posted by Ritesh Kumar Singh <ri...@gmail.com>.
Hi Flavio,

   1. How do you access your rdf dataset via flink? Are you reading it as a
   normal input file and splitting the records or you have some wrappers in
   place to convert the rdf data into triples? Can you please share some code
   samples if possible?
   2. I am using Jena TDB command line utilities to make queries against
   the dataset in order to avoid java garbage collection issues. I am also
   using Jena java APIs as a dependency but command line utils are way faster
   (Though it comes with an extra requirement to have Jena command line utils
   installed in the system). Main reason for this approach being able to pass
   the string output from the command line to Flink as part of my pipeline.
   Can you tell me your approach to this?
   3. Should I dump my query output to a file and then consume it as a
   normal input source for Flink?


Basically, any help regarding this will be helpful.

Regards,
Ritesh



Ritesh Kumar Singh
[image: https://]about.me/riteshoneinamillion
<https://about.me/riteshoneinamillion?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

On Wed, Apr 6, 2016 at 2:45 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Ho Ritesh,
> I have sone experience with Rdf and Flink. What do you mean for accessing
> a Jena model? How do you create it?
>
> From my experience reading triples from jena models is evil because it has
> some problems with garbage collection.
> On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <ri...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I need some suggestions regarding accessing RDF triples from flink. I'm
>> trying to integrate flink in a pipeline where the input for flink comes
>> from SPARQL query on a Jena model. And after modification of triples using
>> flink, I will be performing SPARQL update using Jena to save my changes.
>>
>>    - Are there any recommended input format for loading the triples to
>>    flink?
>>    - Will this use case be classified as a flink streaming job or a
>>    batch processing job?
>>    - How will loading of the dataset vary with the input size?
>>    - Are there any recommended packages/ projects for these type of
>>    projects?
>>
>> Any suggestion will be of great help.
>>
>> Regards,
>> Ritesh
>> https://riteshtoday.wordpress.com/
>>
>

Re: Accessing RDF triples using Flink

Posted by Flavio Pompermaier <po...@okkam.it>.
Ho Ritesh,
I have sone experience with Rdf and Flink. What do you mean for accessing a
Jena model? How do you create it?

>From my experience reading triples from jena models is evil because it has
some problems with garbage collection.
On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <ri...@gmail.com>
wrote:

> Hi,
>
> I need some suggestions regarding accessing RDF triples from flink. I'm
> trying to integrate flink in a pipeline where the input for flink comes
> from SPARQL query on a Jena model. And after modification of triples using
> flink, I will be performing SPARQL update using Jena to save my changes.
>
>    - Are there any recommended input format for loading the triples to
>    flink?
>    - Will this use case be classified as a flink streaming job or a batch
>    processing job?
>    - How will loading of the dataset vary with the input size?
>    - Are there any recommended packages/ projects for these type of
>    projects?
>
> Any suggestion will be of great help.
>
> Regards,
> Ritesh
> https://riteshtoday.wordpress.com/
>