You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Leolh <gi...@git.apache.org> on 2014/08/26 09:34:11 UTC

[GitHub] spark pull request: [SPARK-3228][Streaming]

GitHub user Leolh opened a pull request:

    https://github.com/apache/spark/pull/2132

    [SPARK-3228][Streaming]

    When I use DStream to save files to hdfs, it will create a directory and a empty file named "_SUCCESS" for each job which made in the batch duration.
    But if there are no data from source for a long time , and the duration is very short(e.g. 10s), it will create so many directory and empty files in hdfs.
    I don't think it is necessary. So I want to modify class DStream's method saveAsObjectFiles and saveAsTextFiles , it creates directory and files just when the RDD's partitions size > 0 .

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Leolh/spark spark-streaming

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2132
    
----
commit 7eca5c7b323f4ba0e83355e22d0508cfb9381880
Author: leo <le...@leo.localdomain>
Date:   2014-08-26T07:14:13Z

    When DStream save RDD to hdfs , don't create directory and empty file if there are no data received from source in the batch duration .

commit 35678d22a97c059e319b2fe53be69c989a855674
Author: leo <le...@leo.localdomain>
Date:   2014-08-26T07:23:17Z

    modify the code format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-53946301
  
    Can you please add a title to the PR. And also, this is a tricky change as this actually changes the user-perceived behavior of saveAsXXXFile. If someone has set up a system that expects a new file every batch, irrespecitve of the fact that it has empty data or not, then this change will break the system.
    
    This functionality can be very easily replicated in user code, by doing
    
    ```
    dstream.foreachRDD((rdd: RDD[XXX], time: Time) => {
         val fileName = prefix + time.milliseconds + suffix
         rdd.saveAsXXXFile(fileName)
    })
    ```
    
    So I am not convinced that this is a good change, especially because it breaks exisitng behavior.
    Any thoughts?
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-53385634
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-53622923
  
    When would the RDD not have any partitions? It seems that if you use a reduce, updateStateByKey, or anything like that, we will always have partitions, so this won't save a lot of hassle in most jobs. It would be better if you implement a cleanup process in your application to get rid of these files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-62487590
  
    If you are unable to update this patch, then mind closing this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-54694392
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by Leolh <gi...@git.apache.org>.
Github user Leolh closed the pull request at:

    https://github.com/apache/spark/pull/2132


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3228][Streaming]

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/2132#issuecomment-54376326
  
    @Leolh Any thoughts on @mateiz and my comments?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org