You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nkechi Achara <nk...@googlemail.com> on 2016/10/21 14:53:50 UTC

Issues with reading gz files with Spark Streaming

Hi,

I am using Spark 1.5.0 to read gz files with textFileStream, but when new
files are dropped in the specified directory. I know this is only the case
with gz files as when i extract the file into the directory specified the
files are read on the next window and processed.

My code is here:

val comments = ssc.fileStream[LongWritable, Text,
TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
      map(pair => pair._2.toString)
    comments.foreachRDD(i => i.foreach(m=> println(m)))

any idea why the gz files are not being recognized.

Thanks in advance,

K

Re: Issues with reading gz files with Spark Streaming

Posted by Steve Loughran <st...@hortonworks.com>.
On 22 Oct 2016, at 20:58, Nkechi Achara <nk...@googlemail.com>> wrote:

I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format.

in that case there's nothing obvious to mee.

try logging at trace/debug the class:
org.apache.spark.sql.execution.streaming.FileStreamSource


On 22 October 2016 at 15:14, Steve Loughran <st...@hortonworks.com>> wrote:

> On 21 Oct 2016, at 15:53, Nkechi Achara <nk...@googlemail.com>> wrote:
>
> Hi,
>
> I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed.
>
> My code is here:
>
> val comments = ssc.fileStream[LongWritable, Text, TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
>       map(pair => pair._2.toString)
>     comments.foreachRDD(i => i.foreach(m=> println(m)))
>
> any idea why the gz files are not being recognized.
>
> Thanks in advance,
>
> K

Are the files being written in the directory or renamed in? As you should be using rename() against a filesystem (not an object store) to make sure that the file isn't picked up



Re: Issues with reading gz files with Spark Streaming

Posted by Nkechi Achara <nk...@googlemail.com>.
I do not use rename, and the files are written to, and then moved to a
directory on HDFS in gz format.

On 22 October 2016 at 15:14, Steve Loughran <st...@hortonworks.com> wrote:

>
> > On 21 Oct 2016, at 15:53, Nkechi Achara <nk...@googlemail.com> wrote:
> >
> > Hi,
> >
> > I am using Spark 1.5.0 to read gz files with textFileStream, but when
> new files are dropped in the specified directory. I know this is only the
> case with gz files as when i extract the file into the directory specified
> the files are read on the next window and processed.
> >
> > My code is here:
> >
> > val comments = ssc.fileStream[LongWritable, Text,
> TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
> >       map(pair => pair._2.toString)
> >     comments.foreachRDD(i => i.foreach(m=> println(m)))
> >
> > any idea why the gz files are not being recognized.
> >
> > Thanks in advance,
> >
> > K
>
> Are the files being written in the directory or renamed in? As you should
> be using rename() against a filesystem (not an object store) to make sure
> that the file isn't picked up
>

Re: Issues with reading gz files with Spark Streaming

Posted by Steve Loughran <st...@hortonworks.com>.
> On 21 Oct 2016, at 15:53, Nkechi Achara <nk...@googlemail.com> wrote:
> 
> Hi, 
> 
> I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed.
> 
> My code is here:
> 
> val comments = ssc.fileStream[LongWritable, Text, TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false).
>       map(pair => pair._2.toString)
>     comments.foreachRDD(i => i.foreach(m=> println(m)))
> 
> any idea why the gz files are not being recognized.
> 
> Thanks in advance,
> 
> K

Are the files being written in the directory or renamed in? As you should be using rename() against a filesystem (not an object store) to make sure that the file isn't picked up

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org