You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/30 04:58:13 UTC

[GitHub] [spark] xor007 commented on issue #14918: [SPARK-17360][PYSPARK] Support generator in createDataFrame

xor007 commented on issue #14918: [SPARK-17360][PYSPARK] Support generator in createDataFrame
URL: https://github.com/apache/spark/pull/14918#issuecomment-487821340
 
 
   > Do we have any usecases or benchmarks for cases where this would be helpful?
   
   Yes my huge use case which I am surprised a lot of people in industry don't have is **massive data mining**:
   
   - You have a lot of files on the internet (for instance text from a large collection of webpages)
   - You are able to write a python generator that goes through the files to find and ouput sentences containing the word "covfefe": I have seen a python generator go through 90G of such a real collection of 11000 files within minutes(they where downloaded)
   - You want to create a dataframe of all those sentences and the actual collection of those sentences ends up being less than 20Mb
   
   You could create a Dataset from the generator. Now that I have written this it seems I can run flatmap on the file with the generator as the transformation.
   
   But something like Dataframe.from_generator in spark would be nice.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org