You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Suraj Satishkumar Sheth <su...@adobe.com> on 2014/05/28 16:11:01 UTC

Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Hi,
We have a use case where we have a PTable which consists of 30 keys and millions of values per key. We want to write the values foe each of the keys into separate files.
Although, creating 30 different PTables using filter and then, writing each of them to HDFS is working for us, it is highly inefficient.

I have been trying to write data from a PTable into multiple files corresponding to the values of the keys using AvroPathPerKeyTarget.

So, the usage is something like this :
finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));

where finalRecords is a PCollection of Avro

It is verified that the data contains exactly 30 unique keys. The amount of data is a few millions for a few keys while a few thousands for a few other keys.

Expectation : It will divide the data 30 parts and write them to the specified place in HDFS creating a directory for each key. We will be able to read the data as a PCollection<Avro> later for our next job.

Issue : It is able to create 30 different directories for the keys and all the directories have data of non-zero size.
But, occasionally, a few files get corrupted. When we try to read it into a PCollection<Avro> and try to use it, it throws an error :
Caused by: java.io.IOException: Invalid sync!

Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and only one or two files among 30 get corrupted in that run.
The filesize of the corrupted Avro file is either very high or very low than expected. E.g. if we are expecting a file of 100MB, we will get a file of 30MB or 250MB if that is corrupted due to AvroPathPerKeyTarget.

We increased the number of reducers to 500, so that, no two keys(among 30 keys) go to the same reducer. Inspite of this change, we were able to see the error.

Any ideas/suggestions to fix this issue or explanation of this issue will be helpful.

Thanks and Regards,
Suraj Sheth

Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Som Satpathy <so...@gmail.com>.

Hi Gabriel,

Thanks for the quick response. Sure, let me try out the current head branch
to see if CRUNCH-316 fixes it.

Thanks,
Som


On Thu, May 29, 2014 at 11:43 AM, Gabriel Reid <ga...@gmail.com>
wrote:

> Hey Som,
>
> No, no need for a custom partitioner or special GroupByOptions when
> you're using the AvroPathPerKeyTarget. As you probably know, it's
> definitely a good idea to have all values under the same key next to
> each other in the PTable that is being output.
>
> Any chance you could try this with a build from the current head of
> the 0.8 branch? It's named apache-crunch-0.8 in git. This really
> sounds like it's related to CRUNCH-316, so it would be good if we
> could check if that fix corrects this issue or not.
>
> - Gabriel
>
>
> On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <so...@gmail.com>
> wrote:
> > Hi Josh/Gabriel,
> >
> > This problem has been confounding us for a while. Do we need to pass a
> > custom Partitioner or pass specific GroupByOptions into the groupBy to
> make
> > it work with the AvroPathPerKeyTarget? I assume there is no need for
> that.
> >
> > Thanks,
> > Som
> >
> >
> > On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth
> > <su...@adobe.com> wrote:
> >>
> >> Hi Josh,
> >>
> >> Thanks for the quick response
> >>
> >>
> >>
> >> Here are the logs :
> >>
> >> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> >> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
> at
> >>
> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
> >> at
> >>
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
> >> at
> >>
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> >> at
> >>
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> >> at
> >>
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
> >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
> >> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
> >> java.security.AccessController.doPrivileged(Native Method) at
> >> javax.security.auth.Subject.doAs(Subject.java:415) at
> >> org.apache.hadoop.security.UserGroupInformation.d
> >>
> >>
> >>
> >> Even when we read the output of AvroPathPerKeyTarget into a PCollection
> >> and try to count the number of records in the PCollection, we get the
> same
> >> error.
> >>
> >> The strange thing is that this occurs rarely(once in 3-4 times) even
> when
> >> we try it on the same data multiple times.
> >>
> >>
> >>
> >>
> >>
> >> The versions being used :
> >>
> >> Avro – 1.7.5
> >>
> >> Crunch - 0.8.2-hadoop2
> >>
> >>
> >>
> >> Thanks and Regards,
> >>
> >> Suraj Sheth
> >>
> >>
> >>
> >> From: Josh Wills [mailto:jwills@cloudera.com]
> >> Sent: Wednesday, May 28, 2014 7:56 PM
> >> To: user@crunch.apache.org
> >> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing
> data
> >> to multiple files for each of the keys of the PTable
> >>
> >>
> >>
> >> That sounds super annoying. Which version are you using? There was this
> >> issue that is fixed in master, but not in any release yet. (I'm trying
> to
> >> get one out this week if at all possible.)
> >>
> >>
> >>
> >> https://issues.apache.org/jira/browse/CRUNCH-316
> >>
> >>
> >>
> >> Can you check your logs for that in-memory buffer error?
> >>
> >>
> >>
> >> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth
> >> <su...@adobe.com> wrote:
> >>
> >> Hi,
> >>
> >> We have a use case where we have a PTable which consists of 30 keys and
> >> millions of values per key. We want to write the values for each of the
> keys
> >> into separate files.
> >>
> >> Although, creating 30 different PTables using filter and then, writing
> >> each of them to HDFS is working for us, it is highly inefficient.
> >>
> >>
> >>
> >> I have been trying to write data from a PTable into multiple files
> >> corresponding to the values of the keys using AvroPathPerKeyTarget.
> >>
> >>
> >>
> >> So, the usage is something like this :
> >>
> >> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));
> >>
> >>
> >>
> >> where finalRecords is a PTable whose keys are Strings and values are
> AVRO
> >> records
> >>
> >>
> >>
> >> It is verified that the data contains exactly 30 unique keys. The amount
> >> of data is a few millions for a few keys while a few thousands for a few
> >> other keys.
> >>
> >>
> >>
> >> Expectation : It will divide the data 30 parts and write them to the
> >> specified place in HDFS creating a directory for each key. We will be
> able
> >> to read the data as a PCollection<Avro> later for our next job.
> >>
> >>
> >>
> >> Issue : It is able to create 30 different directories for the keys and
> all
> >> the directories have data of non-zero size.
> >>
> >>        But, occasionally, a few files get corrupted. When we try to read
> >> it into a PCollection<Avro> and try to use it, it throws an error :
> >>
> >>        Caused by: java.io.IOException: Invalid sync!
> >>
> >>
> >>
> >> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs
> and
> >> only one or two files among 30 get corrupted in that run.
> >>
> >>            The filesize of the corrupted Avro file is either very high
> or
> >> very low than expected. E.g. if we are expecting a file of 100MB, we
> will
> >> get a file of 30MB or 250MB if that is corrupted due to
> >> AvroPathPerKeyTarget.
> >>
> >>
> >>
> >> We increased the number of reducers to 500, so that, no two keys(among
> 30
> >> keys) go to the same reducer. Inspite of this change, we were able to
> see
> >> the error.
> >>
> >>
> >>
> >> Any ideas/suggestions to fix this issue or explanation of this issue
> will
> >> be helpful.
> >>
> >>
> >>
> >>
> >>
> >> Thanks and Regards,
> >>
> >> Suraj Sheth
> >>
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Director of Data Science
> >>
> >> Cloudera
> >>
> >> Twitter: @josh_wills
> >
> >
>

Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Gabriel Reid <ga...@gmail.com>.

Hey Som,

No, no need for a custom partitioner or special GroupByOptions when
you're using the AvroPathPerKeyTarget. As you probably know, it's
definitely a good idea to have all values under the same key next to
each other in the PTable that is being output.

Any chance you could try this with a build from the current head of
the 0.8 branch? It's named apache-crunch-0.8 in git. This really
sounds like it's related to CRUNCH-316, so it would be good if we
could check if that fix corrects this issue or not.

- Gabriel


On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <so...@gmail.com> wrote:
> Hi Josh/Gabriel,
>
> This problem has been confounding us for a while. Do we need to pass a
> custom Partitioner or pass specific GroupByOptions into the groupBy to make
> it work with the AvroPathPerKeyTarget? I assume there is no need for that.
>
> Thanks,
> Som
>
>
> On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth
> <su...@adobe.com> wrote:
>>
>> Hi Josh,
>>
>> Thanks for the quick response
>>
>>
>>
>> Here are the logs :
>>
>> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at
>> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
>> at
>> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
>> at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
>> at
>> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
>> at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
>> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.d
>>
>>
>>
>> Even when we read the output of AvroPathPerKeyTarget into a PCollection
>> and try to count the number of records in the PCollection, we get the same
>> error.
>>
>> The strange thing is that this occurs rarely(once in 3-4 times) even when
>> we try it on the same data multiple times.
>>
>>
>>
>>
>>
>> The versions being used :
>>
>> Avro – 1.7.5
>>
>> Crunch - 0.8.2-hadoop2
>>
>>
>>
>> Thanks and Regards,
>>
>> Suraj Sheth
>>
>>
>>
>> From: Josh Wills [mailto:jwills@cloudera.com]
>> Sent: Wednesday, May 28, 2014 7:56 PM
>> To: user@crunch.apache.org
>> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data
>> to multiple files for each of the keys of the PTable
>>
>>
>>
>> That sounds super annoying. Which version are you using? There was this
>> issue that is fixed in master, but not in any release yet. (I'm trying to
>> get one out this week if at all possible.)
>>
>>
>>
>> https://issues.apache.org/jira/browse/CRUNCH-316
>>
>>
>>
>> Can you check your logs for that in-memory buffer error?
>>
>>
>>
>> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth
>> <su...@adobe.com> wrote:
>>
>> Hi,
>>
>> We have a use case where we have a PTable which consists of 30 keys and
>> millions of values per key. We want to write the values for each of the keys
>> into separate files.
>>
>> Although, creating 30 different PTables using filter and then, writing
>> each of them to HDFS is working for us, it is highly inefficient.
>>
>>
>>
>> I have been trying to write data from a PTable into multiple files
>> corresponding to the values of the keys using AvroPathPerKeyTarget.
>>
>>
>>
>> So, the usage is something like this :
>>
>> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));
>>
>>
>>
>> where finalRecords is a PTable whose keys are Strings and values are AVRO
>> records
>>
>>
>>
>> It is verified that the data contains exactly 30 unique keys. The amount
>> of data is a few millions for a few keys while a few thousands for a few
>> other keys.
>>
>>
>>
>> Expectation : It will divide the data 30 parts and write them to the
>> specified place in HDFS creating a directory for each key. We will be able
>> to read the data as a PCollection<Avro> later for our next job.
>>
>>
>>
>> Issue : It is able to create 30 different directories for the keys and all
>> the directories have data of non-zero size.
>>
>>        But, occasionally, a few files get corrupted. When we try to read
>> it into a PCollection<Avro> and try to use it, it throws an error :
>>
>>        Caused by: java.io.IOException: Invalid sync!
>>
>>
>>
>> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and
>> only one or two files among 30 get corrupted in that run.
>>
>>            The filesize of the corrupted Avro file is either very high or
>> very low than expected. E.g. if we are expecting a file of 100MB, we will
>> get a file of 30MB or 250MB if that is corrupted due to
>> AvroPathPerKeyTarget.
>>
>>
>>
>> We increased the number of reducers to 500, so that, no two keys(among 30
>> keys) go to the same reducer. Inspite of this change, we were able to see
>> the error.
>>
>>
>>
>> Any ideas/suggestions to fix this issue or explanation of this issue will
>> be helpful.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Suraj Sheth
>>
>>
>>
>>
>>
>> --
>>
>> Director of Data Science
>>
>> Cloudera
>>
>> Twitter: @josh_wills
>
>

Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Som Satpathy <so...@gmail.com>.

Hi Josh/Gabriel,

This problem has been confounding us for a while. Do we need to pass a
custom Partitioner or pass specific GroupByOptions into the groupBy to make
it work with the AvroPathPerKeyTarget? I assume there is no need for that.

Thanks,
Som


On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth <surajsat@adobe.com
> wrote:

>  Hi Josh,
>
> Thanks for the quick response
>
>
>
> Here are the logs :
>
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at
> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
> at
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:415) at
> org.apache.hadoop.security.UserGroupInformation.d
>
>
>
> Even when we read the output of AvroPathPerKeyTarget into a PCollection
> and try to count the number of records in the PCollection, we get the same
> error.
>
> The strange thing is that this occurs rarely(once in 3-4 times) even when
> we try it on the same data multiple times.
>
>
>
>
>
> The versions being used :
>
> *Avro – 1.7.5*
>
> *Crunch - *0.8.2-hadoop2
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>
>
> *From:* Josh Wills [mailto:jwills@cloudera.com]
> *Sent:* Wednesday, May 28, 2014 7:56 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Issue with AvroPathperKeyTarget in crunch while writing
> data to multiple files for each of the keys of the PTable
>
>
>
> That sounds super annoying. Which version are you using? There was this
> issue that is fixed in master, but not in any release yet. (I'm trying to
> get one out this week if at all possible.)
>
>
>
> https://issues.apache.org/jira/browse/CRUNCH-316
>
>
>
> Can you check your logs for that in-memory buffer error?
>
>
>
> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <
> surajsat@adobe.com> wrote:
>
> Hi,
>
> We have a use case where we have a PTable which consists of 30 keys and
> millions of values per key. We want to write the values for each of the
> keys into separate files.
>
> Although, creating 30 different PTables using filter and then, writing
> each of them to HDFS is working for us, it is highly inefficient.
>
>
>
> I have been trying to write data from a PTable into multiple files
> corresponding to the values of the keys using AvroPathPerKeyTarget.
>
>
>
> So, the usage is something like this :
>
> *finalRecords**.**groupByKey**().**write**(new**
> AvroPathPerKeyTarget(outPath));*
>
>
>
> *where finalRecords is a PTable whose keys are Strings and values are AVRO
> records*
>
>
>
> It is verified that the data contains exactly 30 unique keys. The amount
> of data is a few millions for a few keys while a few thousands for a few
> other keys.
>
>
>
> Expectation : It will divide the data 30 parts and write them to the
> specified place in HDFS creating a directory for each key. We will be able
> to read the data as a PCollection<Avro> later for our next job.
>
>
>
> Issue : It is able to create 30 different directories for the keys and all
> the directories have data of non-zero size.
>
>        But, occasionally, a few files get corrupted. When we try to read
> it into a PCollection<Avro> and try to use it, it throws an error :
>
> *       Caused by: java.io.IOException: Invalid sync!*
>
>
>
> *Symptoms : *The issue occurs intermittently. It occurs once in 3-4 runs
> and only one or two files among 30 get corrupted in that run.
>
>            The filesize of the corrupted Avro file is either very high or
> very low than expected. E.g. if we are expecting a file of 100MB, we will
> get a file of 30MB or 250MB if that is corrupted due to
> AvroPathPerKeyTarget.
>
>
>
> We increased the number of reducers to 500, so that, no two keys(among 30
> keys) go to the same reducer. Inspite of this change, we were able to see
> the error.
>
>
>
> Any ideas/suggestions to fix this issue or explanation of this issue will
> be helpful.
>
>
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>
>
>
>
> --
>
> Director of Data Science
>
> Cloudera <http://www.cloudera.com>
>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

RE: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Suraj Satishkumar Sheth <su...@adobe.com>.

Adding user@avro

From: Suraj Satishkumar Sheth [mailto:surajsat@adobe.com]
Sent: Wednesday, May 28, 2014 8:17 PM
To: user@crunch.apache.org
Subject: RE: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Hi Josh,
Thanks for the quick response

Here are the logs :
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66) at org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.d

Even when we read the output of AvroPathPerKeyTarget into a PCollection and try to count the number of records in the PCollection, we get the same error.
The strange thing is that this occurs rarely(once in 3-4 times) even when we try it on the same data multiple times.

The versions being used :
Avro – 1.7.5
Crunch - 0.8.2-hadoop2

Thanks and Regards,
Suraj Sheth

From: Josh Wills [mailto:jwills@cloudera.com]
Sent: Wednesday, May 28, 2014 7:56 PM
To: user@crunch.apache.org<ma...@crunch.apache.org>
Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

That sounds super annoying. Which version are you using? There was this issue that is fixed in master, but not in any release yet. (I'm trying to get one out this week if at all possible.)

https://issues.apache.org/jira/browse/CRUNCH-316

Can you check your logs for that in-memory buffer error?

On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <su...@adobe.com>> wrote:
Hi,
We have a use case where we have a PTable which consists of 30 keys and millions of values per key. We want to write the values for each of the keys into separate files.
Although, creating 30 different PTables using filter and then, writing each of them to HDFS is working for us, it is highly inefficient.

I have been trying to write data from a PTable into multiple files corresponding to the values of the keys using AvroPathPerKeyTarget.

So, the usage is something like this :
finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));

where finalRecords is a PTable whose keys are Strings and values are AVRO records

It is verified that the data contains exactly 30 unique keys. The amount of data is a few millions for a few keys while a few thousands for a few other keys.

Expectation : It will divide the data 30 parts and write them to the specified place in HDFS creating a directory for each key. We will be able to read the data as a PCollection<Avro> later for our next job.

Issue : It is able to create 30 different directories for the keys and all the directories have data of non-zero size.
       But, occasionally, a few files get corrupted. When we try to read it into a PCollection<Avro> and try to use it, it throws an error :
       Caused by: java.io.IOException: Invalid sync!

Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and only one or two files among 30 get corrupted in that run.
           The filesize of the corrupted Avro file is either very high or very low than expected. E.g. if we are expecting a file of 100MB, we will get a file of 30MB or 250MB if that is corrupted due to AvroPathPerKeyTarget.

We increased the number of reducers to 500, so that, no two keys(among 30 keys) go to the same reducer. Inspite of this change, we were able to see the error.

Any ideas/suggestions to fix this issue or explanation of this issue will be helpful.

Thanks and Regards,
Suraj Sheth

--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>

RE: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Suraj Satishkumar Sheth <su...@adobe.com>.

Hi Josh,
Thanks for the quick response

Here are the logs :
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66) at org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.d

Even when we read the output of AvroPathPerKeyTarget into a PCollection and try to count the number of records in the PCollection, we get the same error.
The strange thing is that this occurs rarely(once in 3-4 times) even when we try it on the same data multiple times.


The versions being used :
Avro – 1.7.5
Crunch - 0.8.2-hadoop2

Thanks and Regards,
Suraj Sheth

From: Josh Wills [mailto:jwills@cloudera.com]
Sent: Wednesday, May 28, 2014 7:56 PM
To: user@crunch.apache.org
Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

That sounds super annoying. Which version are you using? There was this issue that is fixed in master, but not in any release yet. (I'm trying to get one out this week if at all possible.)

https://issues.apache.org/jira/browse/CRUNCH-316

Can you check your logs for that in-memory buffer error?

On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <su...@adobe.com>> wrote:
Hi,
We have a use case where we have a PTable which consists of 30 keys and millions of values per key. We want to write the values for each of the keys into separate files.
Although, creating 30 different PTables using filter and then, writing each of them to HDFS is working for us, it is highly inefficient.

I have been trying to write data from a PTable into multiple files corresponding to the values of the keys using AvroPathPerKeyTarget.

So, the usage is something like this :
finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));

where finalRecords is a PTable whose keys are Strings and values are AVRO records

It is verified that the data contains exactly 30 unique keys. The amount of data is a few millions for a few keys while a few thousands for a few other keys.

Expectation : It will divide the data 30 parts and write them to the specified place in HDFS creating a directory for each key. We will be able to read the data as a PCollection<Avro> later for our next job.

Issue : It is able to create 30 different directories for the keys and all the directories have data of non-zero size.
       But, occasionally, a few files get corrupted. When we try to read it into a PCollection<Avro> and try to use it, it throws an error :
       Caused by: java.io.IOException: Invalid sync!

Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and only one or two files among 30 get corrupted in that run.
           The filesize of the corrupted Avro file is either very high or very low than expected. E.g. if we are expecting a file of 100MB, we will get a file of 30MB or 250MB if that is corrupted due to AvroPathPerKeyTarget.

We increased the number of reducers to 500, so that, no two keys(among 30 keys) go to the same reducer. Inspite of this change, we were able to see the error.

Any ideas/suggestions to fix this issue or explanation of this issue will be helpful.


Thanks and Regards,
Suraj Sheth



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>

Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable

Posted by Josh Wills <jw...@cloudera.com>.

That sounds super annoying. Which version are you using? There was this
issue that is fixed in master, but not in any release yet. (I'm trying to
get one out this week if at all possible.)

https://issues.apache.org/jira/browse/CRUNCH-316

Can you check your logs for that in-memory buffer error?


On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <surajsat@adobe.com
> wrote:

>  Hi,
>
> We have a use case where we have a PTable which consists of 30 keys and
> millions of values per key. We want to write the values foe each of the
> keys into separate files.
>
> Although, creating 30 different PTables using filter and then, writing
> each of them to HDFS is working for us, it is highly inefficient.
>
>
>
> I have been trying to write data from a PTable into multiple files
> corresponding to the values of the keys using AvroPathPerKeyTarget.
>
>
>
> So, the usage is something like this :
>
> *finalRecords**.**groupByKey**().**write**(**new**
> AvroPathPerKeyTarget(outPath));*
>
>
>
> *where finalRecords is a PCollection of Avro*
>
>
>
> It is verified that the data contains exactly 30 unique keys. The amount
> of data is a few millions for a few keys while a few thousands for a few
> other keys.
>
>
>
> Expectation : It will divide the data 30 parts and write them to the
> specified place in HDFS creating a directory for each key. We will be able
> to read the data as a PCollection<Avro> later for our next job.
>
>
>
> Issue : It is able to create 30 different directories for the keys and all
> the directories have data of non-zero size.
>
>        But, occasionally, a few files get corrupted. When we try to read
> it into a PCollection<Avro> and try to use it, it throws an error :
>
> *       Caused by: java.io.IOException: Invalid sync!*
>
>
>
> *Symptoms : *The issue occurs intermittently. It occurs once in 3-4 runs
> and only one or two files among 30 get corrupted in that run.
>
>            The filesize of the corrupted Avro file is either very high or
> very low than expected. E.g. if we are expecting a file of 100MB, we will
> get a file of 30MB or 250MB if that is corrupted due to
> AvroPathPerKeyTarget.
>
>
>
> We increased the number of reducers to 500, so that, no two keys(among 30
> keys) go to the same reducer. Inspite of this change, we were able to see
> the error.
>
>
>
> Any ideas/suggestions to fix this issue or explanation of this issue will
> be helpful.
>
>
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>