You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@rya.apache.org by Matteo Cossu <el...@gmail.com> on 2017/08/25 12:45:16 UTC

Loading tab spaced data

Hello,
I have some problems in loading the data with the Map Reduce code provided.
I am using this class: *org.apache.rya.accumulo.mr.tools.RdfFileInputTool .*
When my input data is in N-Triples format and the triples are tab separated
instead of spaces, I get this error:

*org.openrdf.rio.RDFParseException: Expected '<', found: m*
I solved by substituting all the tabs with spaces in my input data, but
since tabs are a possible separator in the N-Triples format, I think this
should be implemented (or fixed) directly within the tool.

Kind Regards,
Matteo Cossu

Re: Loading tab spaced data

Posted by Puja Valiyil <pu...@gmail.com>.

Hi Matteo,
Rya delegates parsing of input rdf files to the  rdf parsers provided by sesame/openrdf.  So the issue is due to a bug with the openrdf/sesame parser,  it looks like the parser doesn't like tabs.  Upgrading rya to the latest release if open rdf might solve the issue.  Aaron has brought that up as something to do-- no one has started work on it though since it would mean several non trivial changes.  
Hope this helps!  

Sent from my iPhone

> On Aug 28, 2017, at 8:03 PM, Matteo Cossu <el...@gmail.com> wrote:
> 
> I would like to help, but I still can't even test Rya properly. I'm
> developing for research a similar system (using Spark SQL) and I wanted to
> compare my software performances with Rya on the University Cluster.
> When I try to use these Rya tools for loading the data with the big
> datasets, it always crashes (mostly out of memory problems) and it doesn't
> complete the loading. At the moment, I have the urgency of publishing some
> results, so I am comparing my software with other systems.
> Later, I could go back on Rya and try to solve some bugs along the way :P
> 
> Best Regards,
> Matteo Cossu
> 
>> On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:
>> 
>> Hi Matteo,
>> 
>> Thanks for the bug-report. Do you have an interest in making the change to
>> Rya to address this issue? :)
>> 
>> In open source projects, we like to encourage users to make changes to
>> "scratch their own itch". Please let us know how we can help enable you to
>> make this change.
>> 
>>> On 8/25/17 8:45 AM, Matteo Cossu wrote:
>>> 
>>> Hello,
>>> I have some problems in loading the data with the Map Reduce code
>>> provided.
>>> I am using this class: *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
>>> .*
>>> When my input data is in N-Triples format and the triples are tab
>>> separated
>>> instead of spaces, I get this error:
>>> 
>>> *org.openrdf.rio.RDFParseException: Expected '<', found: m*
>>> I solved by substituting all the tabs with spaces in my input data, but
>>> since tabs are a possible separator in the N-Triples format, I think this
>>> should be implemented (or fixed) directly within the tool.
>>> 
>>> Kind Regards,
>>> Matteo Cossu
>>> 
>>>

Re: Loading tab spaced data

Posted by Matteo Cossu <el...@gmail.com>.

Hello,
I fixed the problem with my loading, I did not give enough memory to
Accumulo servers!
Because I am still in time, I would like to use Rya for comparison in my
research.
I read I should use the Prospects Table, do you have any other suggestions
on how to get the best querying performances?
The results could be interesting for Rya. I am using the WatDiv
<http://dsg.uwaterloo.ca/watdiv/>test suite that contains several types of
queries, so it will be easier to identify where Rya performs better and
where (or if) there is room for improvement.

Best Regards,
Matteo Cossu

On 29 August 2017 at 17:12, Matteo Cossu <el...@gmail.com> wrote:

> 1 Billion triples. I don't have a stack trace and but I'll try to get one
> next week.
> Now I'm worried that the problem was caused by some mistake of mine and
> I'm stealing your time :)
> Anyway, I don't convert anything from parquet because it's managed by
> hadoop, and the file is in N-Triples format already.
>
>
> On 29 August 2017 at 16:53, Meier, Caleb <Ca...@parsons.com> wrote:
>
>> Hey Matteo,
>>
>> Do you know offhand how many triples are included in your dataset?  Also,
>> can you send a stack trace?  How are you converting your Parquet file to
>> one of the formats supported by the RdfIngestTool (n-triples, trig, ...)?
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Matteo Cossu [mailto:elcossu@gmail.com]
>> Sent: Tuesday, August 29, 2017 10:36 AM
>> To: dev@rya.incubator.apache.org
>> Subject: Re: Loading tab spaced data
>>
>> Hello Caleb,
>> I was trying to load a 53GB file (in parquet format) with 10 containers
>> with assigned 15GB of memory each.
>> Does someone have some reference numbers, like how how big a dataset can
>> be with these resources?
>> This could help me to know when the problem is entirely mine, that is
>> probable since with many of the tools I'm using (accumulo for example) I'm
>> still a novice.
>>
>> Thank you all for the answers,
>> Matteo Cossu
>>
>>
>> On 29 August 2017 at 16:12, Meier, Caleb <Ca...@parsons.com> wrote:
>>
>> > Hello Matteo,
>> >
>> > Were you using the MapReduce ingest tool when you were running out of
>> > memory?  If so, do you know big the file was that you were ingesting,
>> > how many containers Yarn allocated to your job, and how much memory
>> > was allocated to each container?
>> >
>> > Caleb A. Meier, Ph.D.
>> > Senior Software Engineer ♦ Analyst
>> > Parsons Corporation
>> > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> > Office:  (703)797-3066
>> > Caleb.Meier@Parsons.com ♦ www.parsons.com
>> >
>> > -----Original Message-----
>> > From: Matteo Cossu [mailto:elcossu@gmail.com]
>> > Sent: Monday, August 28, 2017 8:04 PM
>> > To: dev@rya.incubator.apache.org
>> > Subject: Re: Loading tab spaced data
>> >
>> > I would like to help, but I still can't even test Rya properly. I'm
>> > developing for research a similar system (using Spark SQL) and I
>> > wanted to compare my software performances with Rya on the University
>> Cluster.
>> > When I try to use these Rya tools for loading the data with the big
>> > datasets, it always crashes (mostly out of memory problems) and it
>> > doesn't complete the loading. At the moment, I have the urgency of
>> > publishing some results, so I am comparing my software with other
>> systems.
>> > Later, I could go back on Rya and try to solve some bugs along the way
>> > :P
>> >
>> > Best Regards,
>> > Matteo Cossu
>> >
>> > On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:
>> >
>> > > Hi Matteo,
>> > >
>> > > Thanks for the bug-report. Do you have an interest in making the
>> > > change to Rya to address this issue? :)
>> > >
>> > > In open source projects, we like to encourage users to make changes
>> > > to "scratch their own itch". Please let us know how we can help
>> > > enable you to make this change.
>> > >
>> > > On 8/25/17 8:45 AM, Matteo Cossu wrote:
>> > >
>> > >> Hello,
>> > >> I have some problems in loading the data with the Map Reduce code
>> > >> provided.
>> > >> I am using this class:
>> > >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
>> > >> .*
>> > >> When my input data is in N-Triples format and the triples are tab
>> > >> separated instead of spaces, I get this error:
>> > >>
>> > >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I
>> > >> solved by substituting all the tabs with spaces in my input data,
>> > >> but since tabs are a possible separator in the N-Triples format, I
>> > >> think this should be implemented (or fixed) directly within the tool.
>> > >>
>> > >> Kind Regards,
>> > >> Matteo Cossu
>> > >>
>> > >>
>> >
>>
>
>

Re: Loading tab spaced data

Posted by Matteo Cossu <el...@gmail.com>.

1 Billion triples. I don't have a stack trace and but I'll try to get one
next week.
Now I'm worried that the problem was caused by some mistake of mine and I'm
stealing your time :)
Anyway, I don't convert anything from parquet because it's managed by
hadoop, and the file is in N-Triples format already.


On 29 August 2017 at 16:53, Meier, Caleb <Ca...@parsons.com> wrote:

> Hey Matteo,
>
> Do you know offhand how many triples are included in your dataset?  Also,
> can you send a stack trace?  How are you converting your Parquet file to
> one of the formats supported by the RdfIngestTool (n-triples, trig, ...)?
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Matteo Cossu [mailto:elcossu@gmail.com]
> Sent: Tuesday, August 29, 2017 10:36 AM
> To: dev@rya.incubator.apache.org
> Subject: Re: Loading tab spaced data
>
> Hello Caleb,
> I was trying to load a 53GB file (in parquet format) with 10 containers
> with assigned 15GB of memory each.
> Does someone have some reference numbers, like how how big a dataset can
> be with these resources?
> This could help me to know when the problem is entirely mine, that is
> probable since with many of the tools I'm using (accumulo for example) I'm
> still a novice.
>
> Thank you all for the answers,
> Matteo Cossu
>
>
> On 29 August 2017 at 16:12, Meier, Caleb <Ca...@parsons.com> wrote:
>
> > Hello Matteo,
> >
> > Were you using the MapReduce ingest tool when you were running out of
> > memory?  If so, do you know big the file was that you were ingesting,
> > how many containers Yarn allocated to your job, and how much memory
> > was allocated to each container?
> >
> > Caleb A. Meier, Ph.D.
> > Senior Software Engineer ♦ Analyst
> > Parsons Corporation
> > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> > Office:  (703)797-3066
> > Caleb.Meier@Parsons.com ♦ www.parsons.com
> >
> > -----Original Message-----
> > From: Matteo Cossu [mailto:elcossu@gmail.com]
> > Sent: Monday, August 28, 2017 8:04 PM
> > To: dev@rya.incubator.apache.org
> > Subject: Re: Loading tab spaced data
> >
> > I would like to help, but I still can't even test Rya properly. I'm
> > developing for research a similar system (using Spark SQL) and I
> > wanted to compare my software performances with Rya on the University
> Cluster.
> > When I try to use these Rya tools for loading the data with the big
> > datasets, it always crashes (mostly out of memory problems) and it
> > doesn't complete the loading. At the moment, I have the urgency of
> > publishing some results, so I am comparing my software with other
> systems.
> > Later, I could go back on Rya and try to solve some bugs along the way
> > :P
> >
> > Best Regards,
> > Matteo Cossu
> >
> > On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:
> >
> > > Hi Matteo,
> > >
> > > Thanks for the bug-report. Do you have an interest in making the
> > > change to Rya to address this issue? :)
> > >
> > > In open source projects, we like to encourage users to make changes
> > > to "scratch their own itch". Please let us know how we can help
> > > enable you to make this change.
> > >
> > > On 8/25/17 8:45 AM, Matteo Cossu wrote:
> > >
> > >> Hello,
> > >> I have some problems in loading the data with the Map Reduce code
> > >> provided.
> > >> I am using this class:
> > >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
> > >> .*
> > >> When my input data is in N-Triples format and the triples are tab
> > >> separated instead of spaces, I get this error:
> > >>
> > >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I
> > >> solved by substituting all the tabs with spaces in my input data,
> > >> but since tabs are a possible separator in the N-Triples format, I
> > >> think this should be implemented (or fixed) directly within the tool.
> > >>
> > >> Kind Regards,
> > >> Matteo Cossu
> > >>
> > >>
> >
>

RE: Loading tab spaced data

Posted by "Meier, Caleb" <Ca...@parsons.com>.

Hey Matteo, 

Do you know offhand how many triples are included in your dataset?  Also, can you send a stack trace?  How are you converting your Parquet file to one of the formats supported by the RdfIngestTool (n-triples, trig, ...)?

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Matteo Cossu [mailto:elcossu@gmail.com] 
Sent: Tuesday, August 29, 2017 10:36 AM
To: dev@rya.incubator.apache.org
Subject: Re: Loading tab spaced data

Hello Caleb,
I was trying to load a 53GB file (in parquet format) with 10 containers with assigned 15GB of memory each.
Does someone have some reference numbers, like how how big a dataset can be with these resources?
This could help me to know when the problem is entirely mine, that is probable since with many of the tools I'm using (accumulo for example) I'm still a novice.

Thank you all for the answers,
Matteo Cossu


On 29 August 2017 at 16:12, Meier, Caleb <Ca...@parsons.com> wrote:

> Hello Matteo,
>
> Were you using the MapReduce ingest tool when you were running out of 
> memory?  If so, do you know big the file was that you were ingesting, 
> how many containers Yarn allocated to your job, and how much memory 
> was allocated to each container?
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Matteo Cossu [mailto:elcossu@gmail.com]
> Sent: Monday, August 28, 2017 8:04 PM
> To: dev@rya.incubator.apache.org
> Subject: Re: Loading tab spaced data
>
> I would like to help, but I still can't even test Rya properly. I'm 
> developing for research a similar system (using Spark SQL) and I 
> wanted to compare my software performances with Rya on the University Cluster.
> When I try to use these Rya tools for loading the data with the big 
> datasets, it always crashes (mostly out of memory problems) and it 
> doesn't complete the loading. At the moment, I have the urgency of 
> publishing some results, so I am comparing my software with other systems.
> Later, I could go back on Rya and try to solve some bugs along the way 
> :P
>
> Best Regards,
> Matteo Cossu
>
> On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:
>
> > Hi Matteo,
> >
> > Thanks for the bug-report. Do you have an interest in making the 
> > change to Rya to address this issue? :)
> >
> > In open source projects, we like to encourage users to make changes 
> > to "scratch their own itch". Please let us know how we can help 
> > enable you to make this change.
> >
> > On 8/25/17 8:45 AM, Matteo Cossu wrote:
> >
> >> Hello,
> >> I have some problems in loading the data with the Map Reduce code 
> >> provided.
> >> I am using this class:
> >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
> >> .*
> >> When my input data is in N-Triples format and the triples are tab 
> >> separated instead of spaces, I get this error:
> >>
> >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I 
> >> solved by substituting all the tabs with spaces in my input data, 
> >> but since tabs are a possible separator in the N-Triples format, I 
> >> think this should be implemented (or fixed) directly within the tool.
> >>
> >> Kind Regards,
> >> Matteo Cossu
> >>
> >>
>

Re: Loading tab spaced data

Posted by Matteo Cossu <el...@gmail.com>.

Hello Caleb,
I was trying to load a 53GB file (in parquet format) with 10 containers
with assigned 15GB of memory each.
Does someone have some reference numbers, like how how big a dataset can be
with these resources?
This could help me to know when the problem is entirely mine, that is
probable since with many of the tools I'm using (accumulo for example) I'm
still a novice.

Thank you all for the answers,
Matteo Cossu


On 29 August 2017 at 16:12, Meier, Caleb <Ca...@parsons.com> wrote:

> Hello Matteo,
>
> Were you using the MapReduce ingest tool when you were running out of
> memory?  If so, do you know big the file was that you were ingesting, how
> many containers Yarn allocated to your job, and how much memory was
> allocated to each container?
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Matteo Cossu [mailto:elcossu@gmail.com]
> Sent: Monday, August 28, 2017 8:04 PM
> To: dev@rya.incubator.apache.org
> Subject: Re: Loading tab spaced data
>
> I would like to help, but I still can't even test Rya properly. I'm
> developing for research a similar system (using Spark SQL) and I wanted to
> compare my software performances with Rya on the University Cluster.
> When I try to use these Rya tools for loading the data with the big
> datasets, it always crashes (mostly out of memory problems) and it doesn't
> complete the loading. At the moment, I have the urgency of publishing some
> results, so I am comparing my software with other systems.
> Later, I could go back on Rya and try to solve some bugs along the way :P
>
> Best Regards,
> Matteo Cossu
>
> On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:
>
> > Hi Matteo,
> >
> > Thanks for the bug-report. Do you have an interest in making the
> > change to Rya to address this issue? :)
> >
> > In open source projects, we like to encourage users to make changes to
> > "scratch their own itch". Please let us know how we can help enable
> > you to make this change.
> >
> > On 8/25/17 8:45 AM, Matteo Cossu wrote:
> >
> >> Hello,
> >> I have some problems in loading the data with the Map Reduce code
> >> provided.
> >> I am using this class:
> >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
> >> .*
> >> When my input data is in N-Triples format and the triples are tab
> >> separated instead of spaces, I get this error:
> >>
> >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I solved
> >> by substituting all the tabs with spaces in my input data, but since
> >> tabs are a possible separator in the N-Triples format, I think this
> >> should be implemented (or fixed) directly within the tool.
> >>
> >> Kind Regards,
> >> Matteo Cossu
> >>
> >>
>

RE: Loading tab spaced data

Posted by "Meier, Caleb" <Ca...@parsons.com>.

Hello Matteo,

Were you using the MapReduce ingest tool when you were running out of memory?  If so, do you know big the file was that you were ingesting, how many containers Yarn allocated to your job, and how much memory was allocated to each container?  

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Matteo Cossu [mailto:elcossu@gmail.com] 
Sent: Monday, August 28, 2017 8:04 PM
To: dev@rya.incubator.apache.org
Subject: Re: Loading tab spaced data

I would like to help, but I still can't even test Rya properly. I'm developing for research a similar system (using Spark SQL) and I wanted to compare my software performances with Rya on the University Cluster.
When I try to use these Rya tools for loading the data with the big datasets, it always crashes (mostly out of memory problems) and it doesn't complete the loading. At the moment, I have the urgency of publishing some results, so I am comparing my software with other systems.
Later, I could go back on Rya and try to solve some bugs along the way :P

Best Regards,
Matteo Cossu

On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:

> Hi Matteo,
>
> Thanks for the bug-report. Do you have an interest in making the 
> change to Rya to address this issue? :)
>
> In open source projects, we like to encourage users to make changes to 
> "scratch their own itch". Please let us know how we can help enable 
> you to make this change.
>
> On 8/25/17 8:45 AM, Matteo Cossu wrote:
>
>> Hello,
>> I have some problems in loading the data with the Map Reduce code 
>> provided.
>> I am using this class: 
>> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
>> .*
>> When my input data is in N-Triples format and the triples are tab 
>> separated instead of spaces, I get this error:
>>
>> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I solved 
>> by substituting all the tabs with spaces in my input data, but since 
>> tabs are a possible separator in the N-Triples format, I think this 
>> should be implemented (or fixed) directly within the tool.
>>
>> Kind Regards,
>> Matteo Cossu
>>
>>

Re: Loading tab spaced data

Posted by Matteo Cossu <el...@gmail.com>.

I would like to help, but I still can't even test Rya properly. I'm
developing for research a similar system (using Spark SQL) and I wanted to
compare my software performances with Rya on the University Cluster.
When I try to use these Rya tools for loading the data with the big
datasets, it always crashes (mostly out of memory problems) and it doesn't
complete the loading. At the moment, I have the urgency of publishing some
results, so I am comparing my software with other systems.
Later, I could go back on Rya and try to solve some bugs along the way :P

Best Regards,
Matteo Cossu

On 29 August 2017 at 01:29, Josh Elser <el...@apache.org> wrote:

> Hi Matteo,
>
> Thanks for the bug-report. Do you have an interest in making the change to
> Rya to address this issue? :)
>
> In open source projects, we like to encourage users to make changes to
> "scratch their own itch". Please let us know how we can help enable you to
> make this change.
>
> On 8/25/17 8:45 AM, Matteo Cossu wrote:
>
>> Hello,
>> I have some problems in loading the data with the Map Reduce code
>> provided.
>> I am using this class: *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
>> .*
>> When my input data is in N-Triples format and the triples are tab
>> separated
>> instead of spaces, I get this error:
>>
>> *org.openrdf.rio.RDFParseException: Expected '<', found: m*
>> I solved by substituting all the tabs with spaces in my input data, but
>> since tabs are a possible separator in the N-Triples format, I think this
>> should be implemented (or fixed) directly within the tool.
>>
>> Kind Regards,
>> Matteo Cossu
>>
>>

Re: Loading tab spaced data

Posted by Josh Elser <el...@apache.org>.

Hi Matteo,

Thanks for the bug-report. Do you have an interest in making the change 
to Rya to address this issue? :)

In open source projects, we like to encourage users to make changes to 
"scratch their own itch". Please let us know how we can help enable you 
to make this change.

On 8/25/17 8:45 AM, Matteo Cossu wrote:
> Hello,
> I have some problems in loading the data with the Map Reduce code provided.
> I am using this class: *org.apache.rya.accumulo.mr.tools.RdfFileInputTool .*
> When my input data is in N-Triples format and the triples are tab separated
> instead of spaces, I get this error:
> 
> *org.openrdf.rio.RDFParseException: Expected '<', found: m*
> I solved by substituting all the tabs with spaces in my input data, but
> since tabs are a possible separator in the N-Triples format, I think this
> should be implemented (or fixed) directly within the tool.
> 
> Kind Regards,
> Matteo Cossu
>