You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/09/14 15:28:39 UTC

Reading the most recent text files created by Spark streaming

Hi,

I have a Spark streaming that reads messages/prices from Kafka and writes
it as text file to HDFS.

This is pretty efficient. Its only function is to persist the incoming
messages to HDFS.

This is what it does
dstream.foreachRDD { pricesRDD =>
val x= pricesRDD.count
// Check if any messages in
if (x > 0)
{
// Combine each partition's results into a single RDD
val cachedRDD = pricesRDD.repartition(1).cache
cachedRDD.saveAsTextFile("/data/prices/prices_" +
System.currentTimeMillis.toString)
....

So these are the files on HDFS directory

drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11
/data/prices/prices_1473862284010
drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11
/data/prices/prices_1473862288010
drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11
/data/prices/prices_1473862290010
drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11
/data/prices/prices_1473862294010

Now I present these prices to Zeppelin. These files are produced every 2
seconds. However, when I get to plot them, I am only interesting in one
hours data say.

I cater for this by using filter on prices (each has a TIMECREATED).

I don't think this is efficient as I don't want to load all these files. I
just want to to read the prices created in past hour or something.

One thing I considered was to load all prices by converting
System.currentTimeMillis into today's date and fetch the most recent ones.
However, this is looking cumbersome. I can create these files with any
timestamp extension when persisting but System.currentTimeMillis seems to
be most efficient.

Any alternatives you can think of?

Thanks

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Reading the most recent text files created by Spark streaming

Posted by Mich Talebzadeh <mi...@gmail.com>.

Yes thanks. I had flume already for twitter so configured it to get data
from Kafka source and post it to HDFS.

cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 18:36, Jörn Franke <jo...@gmail.com> wrote:

> Hi,
>
> An alternative to Spark could be flume to store data from Kafka to HDFS.
> It provides also some reliability mechanisms and has been explicitly
> designed for import/export and is tested. Not sure if i would go for spark
> streaming if the use case is only storing, but I do not have the full
> picture of your use case.
>
> Anyway, what you could do is create a directory / hour/ day etc (whatever
> you need) and put the corresponding files there. If there are a lot of
> small files you can put them into a Hadoop Archive (HAR) to reduce load on
> the namenode.
>
> Best  regards
>
> On 14 Sep 2016, at 17:28, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> I have a Spark streaming that reads messages/prices from Kafka and writes
> it as text file to HDFS.
>
> This is pretty efficient. Its only function is to persist the incoming
> messages to HDFS.
>
> This is what it does
>      dstream.foreachRDD { pricesRDD =>
>        val x= pricesRDD.count
>        // Check if any messages in
>        if (x > 0)
>        {
>            // Combine each partition's results into a single RDD
>          val cachedRDD = pricesRDD.repartition(1).cache
>          cachedRDD.saveAsTextFile("/data/prices/prices_" +
> System.currentTimeMillis.toString)
> ....
>
> So these are the files on HDFS directory
>
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
> /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
> /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
> /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
> /data/prices/prices_1473862294010
>
> Now I present these prices to Zeppelin. These files are produced every 2
> seconds. However, when I get to plot them, I am only interesting in one
> hours data say.
>
> I cater for this by using filter on prices (each has a TIMECREATED).
>
>
> I don't think this is efficient as I don't want to load all these files. I
> just want to  to read the prices created in past hour or something.
>
>
> One thing I considered was to load all prices by converting
> System.currentTimeMillis into today's date and fetch the most recent ones.
> However, this is looking cumbersome. I can create these files with any
> timestamp extension when persisting but System.currentTimeMillis seems to
> be most efficient.
>
>
> Any alternatives you can think of?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: Reading the most recent text files created by Spark streaming

Posted by Jörn Franke <jo...@gmail.com>.

Hi,

An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full picture of your use case.

Anyway, what you could do is create a directory / hour/ day etc (whatever you need) and put the corresponding files there. If there are a lot of small files you can put them into a Hadoop Archive (HAR) to reduce load on the namenode.

Best  regards

> On 14 Sep 2016, at 17:28, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> I have a Spark streaming that reads messages/prices from Kafka and writes it as text file to HDFS.
> 
> This is pretty efficient. Its only function is to persist the incoming messages to HDFS.
> 
> This is what it does
>      dstream.foreachRDD { pricesRDD =>
>        val x= pricesRDD.count
>        // Check if any messages in
>        if (x > 0)
>        {
>            // Combine each partition's results into a single RDD
>          val cachedRDD = pricesRDD.repartition(1).cache
>          cachedRDD.saveAsTextFile("/data/prices/prices_" + System.currentTimeMillis.toString)
> ....
> 
> So these are the files on HDFS directory
> 
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862294010
> 
> Now I present these prices to Zeppelin. These files are produced every 2 seconds. However, when I get to plot them, I am only interesting in one hours data say.
> I cater for this by using filter on prices (each has a TIMECREATED).
> 
> I don't think this is efficient as I don't want to load all these files. I just want to  to read the prices created in past hour or something.
> 
> One thing I considered was to load all prices by converting System.currentTimeMillis into today's date and fetch the most recent ones. However, this is looking cumbersome. I can create these files with any timestamp extension when persisting but System.currentTimeMillis seems to be most efficient.
> 
> Any alternatives you can think of?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>