You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by steveloughran <gi...@git.apache.org> on 2017/01/03 13:49:04 UTC

[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14731#discussion_r94407115

--- Diff: docs/streaming-programming-guide.md ---
@@ -644,17 +644,90 @@ methods for creating DStreams from files as input sources.
</div>
</div>

- Spark Streaming will monitor the directory `dataDirectory` and process any files created in that directory (files written in nested directories not supported). Note that
+ Spark Streaming will monitor the directory `dataDirectory` and process any files created in that directory.
+
+ ++ The files must have the same data format.
+ + A simple directory can be monitored, such as `hdfs://namenode:8040/logs/`.
+ All files directly such a path will be processed as they are discovered.
+ + A POSIX glob pattern can be supplied, such as
+ `hdfs://namenode:8040/logs/2016-??-31`.
+ Here, the DStream will consist of all files directly under those directories
+ matching the regular expression.
--- End diff --

I added a link to the posix docs. If you follow them, you eventually end up on some coverage of regexps inside []; the Hadoop Glob code does actually convert the shell expression to a java regexp, then compile it in, so presumably should handle everything that the regexp engine (originally {{java.util.regexp}}, currently {{com.google.re2j}} can compile. That's too much detail and something that should really be covered in the Hadoop docs by someone.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org