You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2015/10/24 03:35:34 UTC

streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?

Any idea what is the best way to save twitter status to HDFS?

        JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(1000));

        Authorization twitterAuth = setupTwitterAuthorization();

        JavaDStream<Status> tweets =
TwitterFilterQueryUtils.createStream(ssc, twitterAuth, query);



http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-
operations-on-dstreams



saveAsHadoopFiles(prefix, [suffix])Save this DStream's contents as Hadoop
files. The file name at each batch interval is generated based on prefix and
suffix: "prefix-TIME_IN_MS[.suffix]".
Python API This is not available in the Python API.


How ever JavaDStream<> does not support any savesAs* functions



        DStream<Status> dStream = tweets.dstream();


http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstr
eam/DStream.html
DStream<Status> only supports saveAsObjectFiles
<http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst
ream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)>
()and saveAsTextFiles
<http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst
ream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)> (()


saveAsTextFiles
public void saveAsTextFiles(java.lang.String prefix,
                   java.lang.String suffix)
Save each RDD in this DStream as at text file, using string representation
of elements. The file name at each batch interval is generated based on
prefix andsuffix: "prefix-TIME_IN_MS.suffix².


Any idea where I would find these files? I assume they will be spread out
all over my cluster?


Also I wonder if using the saveAs*() functions are going to cause other
problems. My duration is set to 1 sec. Am I going to overwhelm the system
with a bunch of tiny files? Many of them will be empty



Kind regards



Andy



Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since
you want to store the Status object and for every batch it will create a
directory under /status (name will mostly be the timestamp), since the data
is small (hardly couple of MBs for 1 sec interval) it will not overwhelm
the cluster.

Thanks
Best Regards

On Sat, Oct 24, 2015 at 7:05 AM, Andy Davidson <
Andy@santacruzintegration.com> wrote:

> I need to save the twitter status I receive so that I can do additional
> batch based processing on them in the future. Is it safe to assume HDFS is
> the best way to go?
>
> Any idea what is the best way to save twitter status to HDFS?
>
>         JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
> Duration(1000));
>
>         Authorization twitterAuth = setupTwitterAuthorization();
>
>         JavaDStream<Status> tweets = TwitterFilterQueryUtils.createStream(
> ssc, twitterAuth, query);
>
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
>
>
>
> *saveAsHadoopFiles*(*prefix*, [*suffix*])Save this DStream's contents as
> Hadoop files. The file name at each batch interval is generated based on
> *prefix* and *suffix*: *"prefix-TIME_IN_MS[.suffix]"*.
> Python API This is not available in the Python API.
>
> How ever JavaDStream<> does not support any savesAs* functions
>
>
>         DStream<Status> dStream = tweets.dstream();
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html
>
> DStream<Status> only supports *saveAsObjectFiles
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)>()and **saveAsTextFiles
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)>*
> (()
>
>
> saveAsTextFiles
>
> public void saveAsTextFiles(java.lang.String prefix,
>                    java.lang.String suffix)
>
> Save each RDD in this DStream as at text file, using string representation
> of elements. The file name at each batch interval is generated based on
> prefix andsuffix: "prefix-TIME_IN_MS.suffix”.
>
>
> Any idea where I would find these files? I assume they will be spread out
> all over my cluster?
>
>
> Also I wonder if using the saveAs*() functions are going to cause other
> problems. My duration is set to 1 sec. Am I going to overwhelm the system
> with a bunch of tiny files? Many of them will be empty
>
>
> Kind regards
>
>
> Andy
>