You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Akhil Das <ak...@sigmoidanalytics.com> on 2015/08/02 10:03:26 UTC

Re: Does Spark Streaming need to list all the files in a directory?

I guess it goes through that 500k files
<https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193>for
the first time and then use a filter from next time.

Thanks
Best Regards

On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das <td...@databricks.com> wrote:

> For the first time it needs to list them. AFter that the list should be
> cached by the file stream implementation (as far as I remember).
>
>
> On Thu, Jul 30, 2015 at 3:55 PM, Brandon White <bw...@gmail.com>
> wrote:
>
>> Is this a known bottle neck for Spark Streaming textFileStream? Does it
>> need to list all the current files in a directory before he gets the new
>> files? Say I have 500k files in a directory, does it list them all in order
>> to get the new files?
>>
>
>