You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Brandon White <bw...@gmail.com> on 2015/07/31 00:55:10 UTC

Does Spark Streaming need to list all the files in a directory?

Is this a known bottle neck for Spark Streaming textFileStream? Does it
need to list all the current files in a directory before he gets the new
files? Say I have 500k files in a directory, does it list them all in order
to get the new files?

Re: Does Spark Streaming need to list all the files in a directory?

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
I guess it goes through that 500k files
<https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193>for
the first time and then use a filter from next time.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das <td...@databricks.com> wrote:

> For the first time it needs to list them. AFter that the list should be
> cached by the file stream implementation (as far as I remember).
>
>
> On Thu, Jul 30, 2015 at 3:55 PM, Brandon White <bw...@gmail.com>
> wrote:
>
>> Is this a known bottle neck for Spark Streaming textFileStream? Does it
>> need to list all the current files in a directory before he gets the new
>> files? Say I have 500k files in a directory, does it list them all in order
>> to get the new files?
>>
>
>

Re: Does Spark Streaming need to list all the files in a directory?

Posted by Tathagata Das <td...@databricks.com>.
For the first time it needs to list them. AFter that the list should be
cached by the file stream implementation (as far as I remember).


On Thu, Jul 30, 2015 at 3:55 PM, Brandon White <bw...@gmail.com>
wrote:

> Is this a known bottle neck for Spark Streaming textFileStream? Does it
> need to list all the current files in a directory before he gets the new
> files? Say I have 500k files in a directory, does it list them all in order
> to get the new files?
>