You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by amin mohebbi <am...@yahoo.com.INVALID> on 2018/05/24 06:49:52 UTC

Time series data

Could you please help me to understand  the performance that we get from using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10'sor 100's of TBs of data and I feel that NoSQL will be much quicker thanHadoop/Spark. This is time series data that are coming from many devices in form of flat files and it is currently extracted / transformed /loaded  into another database which is connected to BI tools. We might use azure data factory to collect the flat files and then use spark to do the ETL(not sure if it is correct way) and then use spark to join table or do the aggregations and save them into a db (preferably nosql not sure). Finally, connect deploy Power BI to get visualize the data from nosql db. My questions are :

1- Is the above mentioned correct architecture? using spark with nosql as I think combination of these two could help to have random access and run many queries by different users. 2- do we really need to use a time series db? 

Best Regards ....................................................... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   Tel : +60 18 2040 017 E-Mail : TP025921@ex.apiit.edu.my               amin_524@me.com

Re: Time series data

Posted by Jörn Franke <jo...@gmail.com>.
There is not one answer to this. 

It really depends what kind of time Series analysis you do with the data and what time series database you are using. Then it also depends what Etl you need to do.
You seem to also need to join data - is it with existing data of the same type or do you join completely different data. If so where does this data come from?

360 GB / day / uncompressed does not sound terrible much.

> On 24. May 2018, at 08:49, amin mohebbi <am...@yahoo.com.INVALID> wrote:
> 
> Could you please help me to understand  the performance that we get from using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10's or 100's of TBs of data and I feel that NoSQL will be much quicker than Hadoop/Spark. This is time series data that are coming from many devices in form of flat files and it is currently extracted / transformed /loaded  into another database which is connected to BI tools. We might use azure data factory to collect the flat files and then use spark to do the ETL(not sure if it is correct way) and then use spark to join table or do the aggregations and save them into a db (preferably nosql not sure). Finally, connect deploy Power BI to get visualize the data from nosql db. My questions are :
> 
> 1- Is the above mentioned correct architecture? using spark with nosql as I think combination of these two could help to have random access and run many queries by different users. 
> 2- do we really need to use a time series db? 
> 
> 
> Best Regards ....................................................... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   Tel : +60 18 2040 017 E-Mail : TP025921@ex.apiit.edu.my               amin_524@me.com

Re: Time series data

Posted by Vadim Semenov <va...@datadoghq.com>.
Yeah, it depends on what you want to do with that timeseries data. We at
Datadog process trillions of points daily using Spark, I cannot really go
about what exactly we do with the data, but just saying that Spark can
handle the volume, scale well and be fault-tolerant, albeit everything I
said comes with multiple asterisks.

On Thursday, May 24, 2018, amin mohebbi <am...@yahoo.com.invalid> wrote:

> Could you please help me to understand  the performance that we get from
> using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings
> = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with
> 10's or 100's of TBs of data and I feel that NoSQL will be much quicker
> than Hadoop/Spark. This is time series data that are coming from many
> devices in form of flat files and it is currently extracted / transformed /loaded
> into another database which is connected to BI tools. We might use azure
> data factory to collect the flat files and then use spark to do the ETL(not
> sure if it is correct way) and then use spark to join table or do the
> aggregations and save them into a db (preferably nosql not sure).
> Finally, connect deploy Power BI to get visualize the data from nosql db.
> My questions are :
>
> 1- Is the above mentioned correct architecture? using spark with nosql as
> I think combination of these two could help to have random access and run
> many queries by different users.
> 2- do we really need to use a time series db?
>
>
> Best Regards ....................................................... Amin
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia
> Tel : +60 18 2040 017 E-Mail : TP025921@ex.apiit.edu.my
> amin_524@me.com
>


-- 
Sent from my iPhone