You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by pw...@apache.org on 2014/01/14 04:45:41 UTC

[2/2] git commit: Merge pull request #411 from tdas/filestream-fix

Merge pull request #411 from tdas/filestream-fix

Improved logic of finding new files in FileInputDStream

Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it.

The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further.

Also reduced line lengths in DStream to <=100 chars.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/a2fee38e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/a2fee38e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/a2fee38e

Branch: refs/heads/master
Commit: a2fee38ee054c7dd6ff5f5d72f036fef54194d53
Parents: 01c0d72 c0bb38e
Author: Patrick Wendell <pw...@gmail.com>
Authored: Mon Jan 13 19:45:26 2014 -0800
Committer: Patrick Wendell <pw...@gmail.com>
Committed: Mon Jan 13 19:45:26 2014 -0800

----------------------------------------------------------------------
 .../spark/streaming/dstream/DStream.scala       | 49 +++++++-----
 .../streaming/dstream/FileInputDStream.scala    | 82 ++++++++++----------
 2 files changed, 69 insertions(+), 62 deletions(-)
----------------------------------------------------------------------