You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Benjamin Smith <be...@ranksoftwareinc.com> on 2016/06/16 16:52:09 UTC

Bug in SequenceFileHdfsFileWriter

Hello,

I am working on a project where we are integrating Samza and Hive. As part of this project, we ran into an issue where sequence files written from Samza were taking a long time (hours) to completely sync with HDFS.

After some Googling and digging into the code, it appears that the issue is here:
https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111

Writer.stream(dfs.create(path)) implies that the caller of dfs.create(path) is responsible for closing the created stream explicitly. This doesn't happen, and the SequenceFileHdfsWriter call to close will only flush the stream.

I believe the correct line should be:

Writer.file(path)

Or, SequenceFileHdfsWriter should explicitly track and close the stream.

Thanks!

Ben

Refernece material:
http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238

Re: Bug in SequenceFileHdfsFileWriter

Posted by Jagadish Venkatraman <ja...@gmail.com>.

Hi Benjamin,

SAMZA-968 <https://issues.apache.org/jira/browse/SAMZA-968> is already
assigned to you.

Thanks,
Jagadish

On Thu, Jun 16, 2016 at 10:51 AM, Benjamin Smith <
ben.smith@ranksoftwareinc.com> wrote:

> Sure, looks like a straightforward enough change.
>
>
> I've created: https://issues.apache.org/jira/browse/SAMZA-968
>
>
> I don't see anyway to assign it to myself though?
>
> ________________________________
> From: Yi Pan <ni...@gmail.com>
> Sent: Thursday, June 16, 2016 1:02:59 PM
> To: dev@samza.apache.org
> Subject: Re: Bug in SequenceFileHdfsFileWriter
>
> Hi, Benjamin,
>
> Thanks a lot for reporting this! It makes sense from reading the posts.
> Could you open a JIRA? Are you interested in assigning to yourself and
> contribute the fix?
>
> Thanks a lot again!
>
> -Yi
>
> On Thu, Jun 16, 2016 at 9:52 AM, Benjamin Smith <
> ben.smith@ranksoftwareinc.com> wrote:
>
> >
> > Hello,
> >
> > I am working on a project where we are integrating Samza and Hive. As
> part
> > of this project, we ran into an issue where sequence files written from
> > Samza were taking a long time (hours) to completely sync with HDFS.
> >
> > After some Googling and digging into the code, it appears that the issue
> > is here:
> >
> >
> https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111
> >
> > Writer.stream(dfs.create(path)) implies that the caller of
> > dfs.create(path) is responsible for closing the created stream
> explicitly.
> > This doesn't happen, and the SequenceFileHdfsWriter call to close will
> only
> > flush the stream.
> >
> > I believe the correct line should be:
> >
> > Writer.file(path)
> >
> > Or, SequenceFileHdfsWriter should explicitly track and close the stream.
> >
> > Thanks!
> >
> > Ben
> >
> > Refernece material:
> >
> >
> http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
> >
> >
> https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238
> >
> >
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: Bug in SequenceFileHdfsFileWriter

Posted by Benjamin Smith <be...@ranksoftwareinc.com>.

Sure, looks like a straightforward enough change.


I've created: https://issues.apache.org/jira/browse/SAMZA-968


I don't see anyway to assign it to myself though?

________________________________
From: Yi Pan <ni...@gmail.com>
Sent: Thursday, June 16, 2016 1:02:59 PM
To: dev@samza.apache.org
Subject: Re: Bug in SequenceFileHdfsFileWriter

Hi, Benjamin,

Thanks a lot for reporting this! It makes sense from reading the posts.
Could you open a JIRA? Are you interested in assigning to yourself and
contribute the fix?

Thanks a lot again!

-Yi

On Thu, Jun 16, 2016 at 9:52 AM, Benjamin Smith <
ben.smith@ranksoftwareinc.com> wrote:

>
> Hello,
>
> I am working on a project where we are integrating Samza and Hive. As part
> of this project, we ran into an issue where sequence files written from
> Samza were taking a long time (hours) to completely sync with HDFS.
>
> After some Googling and digging into the code, it appears that the issue
> is here:
>
> https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111
>
> Writer.stream(dfs.create(path)) implies that the caller of
> dfs.create(path) is responsible for closing the created stream explicitly.
> This doesn't happen, and the SequenceFileHdfsWriter call to close will only
> flush the stream.
>
> I believe the correct line should be:
>
> Writer.file(path)
>
> Or, SequenceFileHdfsWriter should explicitly track and close the stream.
>
> Thanks!
>
> Ben
>
> Refernece material:
>
> http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
>
> https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238
>
>

Re: Bug in SequenceFileHdfsFileWriter

Posted by Yi Pan <ni...@gmail.com>.

Hi, Benjamin,

Thanks a lot for reporting this! It makes sense from reading the posts.
Could you open a JIRA? Are you interested in assigning to yourself and
contribute the fix?

Thanks a lot again!

-Yi

On Thu, Jun 16, 2016 at 9:52 AM, Benjamin Smith <
ben.smith@ranksoftwareinc.com> wrote:

>
> Hello,
>
> I am working on a project where we are integrating Samza and Hive. As part
> of this project, we ran into an issue where sequence files written from
> Samza were taking a long time (hours) to completely sync with HDFS.
>
> After some Googling and digging into the code, it appears that the issue
> is here:
>
> https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111
>
> Writer.stream(dfs.create(path)) implies that the caller of
> dfs.create(path) is responsible for closing the created stream explicitly.
> This doesn't happen, and the SequenceFileHdfsWriter call to close will only
> flush the stream.
>
> I believe the correct line should be:
>
> Writer.file(path)
>
> Or, SequenceFileHdfsWriter should explicitly track and close the stream.
>
> Thanks!
>
> Ben
>
> Refernece material:
>
> http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
>
> https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238
>
>