You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/09/17 16:53:28 UTC

Is there such thing as cache fusion with the underlying tables/files on HDFS

Hi,

I am seeing similar issues when I was working on Oracle with Tableau as the
dashboard.

Currently I have a batch layer that gets streaming data from

source -> Kafka -> Flume -> HDFS

It stored on HDFS as text files and a cron process sinks Hive table with
the the external table build on the directory. I tried both ORC and Parquet
but I don't think the query itself is the issue.

Meaning it does not matter how clever your execution engine is, the fact
you still have to do considerable amount of Physical IO (PIO) as opposed
to Logical IO (LIO) to get the data to Zeppelin is on the critical path.

One option is to limit the amount of data in Zeppelin to certain number of
rows or something similar. However, you cannot tell a user he/she cannot
see the full data.

We resolved this with Oracle by using Oracle TimesTen
<http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html>IMDB
to cache certain tables in memory and get them refreshed (depending on
refresh frequency) from the underlying table in Oracle when data is
updated). That is done through cache fusion.

I was looking around and came across Alluxio <http://www.alluxio.org/>.
Ideally I like to utilise such concept like TimesTen. Can one distribute
Hive table data (or any table data) across the nodes cached. In that case
we will be doing Logical IO which is about 20 times or more lightweight
compared to Physical IO.

Anyway this is the concept.

Thanks

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Mich Talebzadeh <mi...@gmail.com>.

Good points

Well the batch layer will be able to read streaming data from flume files
if needed using Spark csv. It may take a bit longer but that is not the
focus of batch layer.

All real time data will be through the speed layer using Spark streaming
where the real time alerts/notification will also be produced. Case in
point immediate notification with regard to liquidity risk associated with
a certain security.

A combined data will be on offer through the Serving Layer and there we may
need to create pre-aggregated data in batch layer to be combined with real
time data from the speed layer.

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 September 2016 at 11:08, Jörn Franke <jo...@gmail.com> wrote:

> Ignite has a special cache for HDFS data (which is not a Java cache), for
> rdds etc. So you are right it is in this sense very different.
>
> Besides caching, from what I see from data scientists is that for
> interactive queries and models evaluation they anyway do not browse the
> complete data. Even with in-memory solutions this is painful slow if you
> receive several TB of data by hour.
>
> What they do is sampling, e.g.select relevant small subset of data,
> evaluate several different models on the sampled data in "real time" and
> then calculate the winning model as batch later.
>
> Additionally probabilistic data structures are employed in some cases. For
> example if you want to count the number of unique viewers of a web site it
> does not make sense to browse through the logs for userids all the time, by
>  employ a hyperloglog structure which needs little money and can be
> accessed in real time.
>
> For the case of visualizations, I think in the area of big data it makes
> also very sense to visualize aggregations based on sampling. If you need
> really the last 0,0001% of precision then you can click on the
> visualization and the system takes some time to calculate it.
>
> On 18 Sep 2016, at 10:54, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks everyone for ideas.
>
> Sounds like Ignite has been taken by GridGain  so becomes similar to
> HazelCast open source by name only. However, an in-memory Java Cache may or
> may not help.
>
> The other options like faster databases are on the table depending who
> wants what (that are normally decisions that includes more than technical
> criteria). Example if the customer already had Tableau, persuading them to
> go for QlickView (as an example) may not work.
>
> So my view is to build the batch layer foundation and leave these finer
> choices to the customer. We will offer Zeppelin with Parquet and ORC with a
> certain refresh of these tables and let the customer decide. I stand
> corrected otherwise.
>
> BTW I did these simple test on using Zeppelin (running on Spark Standalone
> mode)
>
> 1) Read data using Spark sql from Flume text files on HDFS (real time)
> 2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
> 3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)
>
> Timings
>
> 1)            2 min, 16 sec
> 2)            1 min, 1 sec
> 3)            1 min, 6 sec
>
> So unless one splits the atom, ORC or Parquet on Hive look similar
> performance.
>
> In all probability customer has a data warehouse that use Tableau or
> QlikView or similar. Their BAs will carry on using these tools. If they
> have data scientist then they will either use R that has in built UI or can
> use Spark sql with Zeppelin. Also one can fire Zeppelin on each node of
> Spark or even on the same node with different Port. Then of coursed one has
> to think about adequate response in a concurrent environment.
>
> Cheers
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 September 2016 at 08:52, Sean Owen <so...@cloudera.com> wrote:
>
>> Alluxio isn't a database though; it's storage. I may be still harping
>> on the wrong solution for you, but as we discussed offline, that's
>> also what Impala, Drill et al are for.
>>
>> Sorry if this was mentioned before but Ignite is what GridGain became,
>> if that helps.
>>
>> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
>> <mi...@gmail.com> wrote:
>> > Thanks Todd
>> >
>> > As I thought Apache Ignite is a data fabric much like Oracle Coherence
>> cache
>> > or HazelCast.
>> >
>> > The use case is different between an in-memory-database (IMDB) and Data
>> > Fabric. The build that I am dealing with has a 'database centric' view
>> of
>> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
>> > in-memory database will be a better fit. On the other hand If the
>> > application deals solely with Java objects and does not have any notion
>> of a
>> > 'database', does not need SQL style queries and really just wants a
>> > distributed, high performance object storage grid, then I think Ignite
>> would
>> > likely be the preferred choice.
>> >
>> > So will likely go if needed for an in-memory database like Alluxio. I
>> have
>> > seen a rather debatable comparison between Spark and Ignite that looks
>> to be
>> > like a one sided rant.
>> >
>> > HTH
>> >
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>>
>
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Jörn Franke <jo...@gmail.com>.

Ignite has a special cache for HDFS data (which is not a Java cache), for rdds etc. So you are right it is in this sense very different.

Besides caching, from what I see from data scientists is that for interactive queries and models evaluation they anyway do not browse the complete data. Even with in-memory solutions this is painful slow if you receive several TB of data by hour. 

What they do is sampling, e.g.select relevant small subset of data, evaluate several different models on the sampled data in "real time" and then calculate the winning model as batch later. 

Additionally probabilistic data structures are employed in some cases. For example if you want to count the number of unique viewers of a web site it does not make sense to browse through the logs for userids all the time, by  employ a hyperloglog structure which needs little money and can be accessed in real time.

For the case of visualizations, I think in the area of big data it makes also very sense to visualize aggregations based on sampling. If you need really the last 0,0001% of precision then you can click on the visualization and the system takes some time to calculate it.

> On 18 Sep 2016, at 10:54, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Thanks everyone for ideas.
> 
> Sounds like Ignite has been taken by GridGain  so becomes similar to HazelCast open source by name only. However, an in-memory Java Cache may or may not help.
> 
> The other options like faster databases are on the table depending who wants what (that are normally decisions that includes more than technical criteria). Example if the customer already had Tableau, persuading them to go for QlickView (as an example) may not work.
> 
> So my view is to build the batch layer foundation and leave these finer choices to the customer. We will offer Zeppelin with Parquet and ORC with a certain refresh of these tables and let the customer decide. I stand corrected otherwise.
> 
> BTW I did these simple test on using Zeppelin (running on Spark Standalone mode)
> 
> 1) Read data using Spark sql from Flume text files on HDFS (real time)
> 2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
> 3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)
> 
> Timings
> 
> 1)            2 min, 16 sec
> 2)            1 min, 1 sec 
> 3)            1 min, 6 sec
> 
> So unless one splits the atom, ORC or Parquet on Hive look similar performance.
> 
> In all probability customer has a data warehouse that use Tableau or QlikView or similar. Their BAs will carry on using these tools. If they have data scientist then they will either use R that has in built UI or can use Spark sql with Zeppelin. Also one can fire Zeppelin on each node of Spark or even on the same node with different Port. Then of coursed one has to think about adequate response in a concurrent environment.
> 
> Cheers
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 18 September 2016 at 08:52, Sean Owen <so...@cloudera.com> wrote:
>> Alluxio isn't a database though; it's storage. I may be still harping
>> on the wrong solution for you, but as we discussed offline, that's
>> also what Impala, Drill et al are for.
>> 
>> Sorry if this was mentioned before but Ignite is what GridGain became,
>> if that helps.
>> 
>> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
>> <mi...@gmail.com> wrote:
>> > Thanks Todd
>> >
>> > As I thought Apache Ignite is a data fabric much like Oracle Coherence cache
>> > or HazelCast.
>> >
>> > The use case is different between an in-memory-database (IMDB) and Data
>> > Fabric. The build that I am dealing with has a 'database centric' view of
>> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
>> > in-memory database will be a better fit. On the other hand If the
>> > application deals solely with Java objects and does not have any notion of a
>> > 'database', does not need SQL style queries and really just wants a
>> > distributed, high performance object storage grid, then I think Ignite would
>> > likely be the preferred choice.
>> >
>> > So will likely go if needed for an in-memory database like Alluxio. I have
>> > seen a rather debatable comparison between Spark and Ignite that looks to be
>> > like a one sided rant.
>> >
>> > HTH
>> >
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may arise
>> > from relying on this email's technical content is explicitly disclaimed. The
>> > author will in no case be liable for any monetary damages arising from such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks everyone for ideas.

Sounds like Ignite has been taken by GridGain  so becomes similar to
HazelCast open source by name only. However, an in-memory Java Cache may or
may not help.

The other options like faster databases are on the table depending who
wants what (that are normally decisions that includes more than technical
criteria). Example if the customer already had Tableau, persuading them to
go for QlickView (as an example) may not work.

So my view is to build the batch layer foundation and leave these finer
choices to the customer. We will offer Zeppelin with Parquet and ORC with a
certain refresh of these tables and let the customer decide. I stand
corrected otherwise.

BTW I did these simple test on using Zeppelin (running on Spark Standalone
mode)

1) Read data using Spark sql from Flume text files on HDFS (real time)
2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)

Timings

1)            2 min, 16 sec
2)            1 min, 1 sec
3)            1 min, 6 sec

So unless one splits the atom, ORC or Parquet on Hive look similar
performance.

In all probability customer has a data warehouse that use Tableau or
QlikView or similar. Their BAs will carry on using these tools. If they
have data scientist then they will either use R that has in built UI or can
use Spark sql with Zeppelin. Also one can fire Zeppelin on each node of
Spark or even on the same node with different Port. Then of coursed one has
to think about adequate response in a concurrent environment.

Cheers

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 18 September 2016 at 08:52, Sean Owen <so...@cloudera.com> wrote:

> Alluxio isn't a database though; it's storage. I may be still harping
> on the wrong solution for you, but as we discussed offline, that's
> also what Impala, Drill et al are for.
>
> Sorry if this was mentioned before but Ignite is what GridGain became,
> if that helps.
>
> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
> > Thanks Todd
> >
> > As I thought Apache Ignite is a data fabric much like Oracle Coherence
> cache
> > or HazelCast.
> >
> > The use case is different between an in-memory-database (IMDB) and Data
> > Fabric. The build that I am dealing with has a 'database centric' view of
> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
> > in-memory database will be a better fit. On the other hand If the
> > application deals solely with Java objects and does not have any notion
> of a
> > 'database', does not need SQL style queries and really just wants a
> > distributed, high performance object storage grid, then I think Ignite
> would
> > likely be the preferred choice.
> >
> > So will likely go if needed for an in-memory database like Alluxio. I
> have
> > seen a rather debatable comparison between Spark and Ignite that looks
> to be
> > like a one sided rant.
> >
> > HTH
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Sean Owen <so...@cloudera.com>.

Alluxio isn't a database though; it's storage. I may be still harping
on the wrong solution for you, but as we discussed offline, that's
also what Impala, Drill et al are for.

Sorry if this was mentioned before but Ignite is what GridGain became,
if that helps.

On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> Thanks Todd
>
> As I thought Apache Ignite is a data fabric much like Oracle Coherence cache
> or HazelCast.
>
> The use case is different between an in-memory-database (IMDB) and Data
> Fabric. The build that I am dealing with has a 'database centric' view of
> its data (i.e. it accesses its data using Spark sql and JDBC) so an
> in-memory database will be a better fit. On the other hand If the
> application deals solely with Java objects and does not have any notion of a
> 'database', does not need SQL style queries and really just wants a
> distributed, high performance object storage grid, then I think Ignite would
> likely be the preferred choice.
>
> So will likely go if needed for an in-memory database like Alluxio. I have
> seen a rather debatable comparison between Spark and Ignite that looks to be
> like a one sided rant.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Todd

As I thought Apache Ignite is a data fabric much like Oracle Coherence
cache or HazelCast.

The use case is different between an in-memory-database (IMDB) and Data
Fabric. The build that I am dealing with has a 'database centric' view of
its data (i.e. it accesses its data using Spark sql and JDBC) so an
in-memory database will be a better fit. On the other hand If the
application deals solely with Java objects and does not have any notion of
a 'database', does not need SQL style queries and really just wants a
distributed, high performance object storage grid, then I think Ignite would
likely be the preferred choice.

So will likely go if needed for an in-memory database like Alluxio. I have
seen a rather debatable comparison between Spark and Ignite
<http://drcos.boudnik.org/2015/04/apache-ignite-vs-apache-spark.html>that
looks to be like a one sided rant.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 20:53, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks Todd.
>
> I will have a look.
>
> Regards
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 September 2016 at 20:45, Todd Nist <ts...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.
>>
>>
>> This looks like something that may be what your looking for:
>>
>> http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin
>>
>> HTH.
>>
>> -Todd
>>
>>
>> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am seeing similar issues when I was working on Oracle with Tableau as
>>> the dashboard.
>>>
>>> Currently I have a batch layer that gets streaming data from
>>>
>>> source -> Kafka -> Flume -> HDFS
>>>
>>> It stored on HDFS as text files and a cron process sinks Hive table with
>>> the the external table build on the directory. I tried both ORC and Parquet
>>> but I don't think the query itself is the issue.
>>>
>>> Meaning it does not matter how clever your execution engine is, the fact
>>> you still have to do  considerable amount of Physical IO (PIO) as opposed
>>> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>>>
>>> One option is to limit the amount of data in Zeppelin to certain number
>>> of rows or something similar. However, you cannot tell a user he/she cannot
>>> see the full data.
>>>
>>> We resolved this with Oracle by using Oracle TimesTen
>>> <http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html>IMDB
>>> to cache certain tables in memory and get them refreshed (depending on
>>> refresh frequency) from the underlying table in Oracle when data is
>>> updated). That is done through cache fusion.
>>>
>>> I was looking around and came across Alluxio <http://www.alluxio.org/>.
>>> Ideally I like to utilise such concept like TimesTen. Can one distribute
>>> Hive table data (or any table data) across the nodes cached. In that case
>>> we will be doing Logical IO which is about 20 times or more lightweight
>>> compared to Physical IO.
>>>
>>> Anyway this is the concept.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Todd.

I will have a look.

Regards

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 September 2016 at 20:45, Todd Nist <ts...@gmail.com> wrote:

> Hi Mich,
>
> Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.
>
>
> This looks like something that may be what your looking for:
>
> http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin
>
> HTH.
>
> -Todd
>
>
> On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> I am seeing similar issues when I was working on Oracle with Tableau as
>> the dashboard.
>>
>> Currently I have a batch layer that gets streaming data from
>>
>> source -> Kafka -> Flume -> HDFS
>>
>> It stored on HDFS as text files and a cron process sinks Hive table with
>> the the external table build on the directory. I tried both ORC and Parquet
>> but I don't think the query itself is the issue.
>>
>> Meaning it does not matter how clever your execution engine is, the fact
>> you still have to do  considerable amount of Physical IO (PIO) as opposed
>> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>>
>> One option is to limit the amount of data in Zeppelin to certain number
>> of rows or something similar. However, you cannot tell a user he/she cannot
>> see the full data.
>>
>> We resolved this with Oracle by using Oracle TimesTen
>> <http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html>IMDB
>> to cache certain tables in memory and get them refreshed (depending on
>> refresh frequency) from the underlying table in Oracle when data is
>> updated). That is done through cache fusion.
>>
>> I was looking around and came across Alluxio <http://www.alluxio.org/>.
>> Ideally I like to utilise such concept like TimesTen. Can one distribute
>> Hive table data (or any table data) across the nodes cached. In that case
>> we will be doing Logical IO which is about 20 times or more lightweight
>> compared to Physical IO.
>>
>> Anyway this is the concept.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Todd Nist <ts...@gmail.com>.

Hi Mich,

Have you looked at Apache Ignite?  https://apacheignite-fs.readme.io/docs.

This looks like something that may be what your looking for:

http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin

HTH.

-Todd


On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi,
>
> I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as text files and a cron process sinks Hive table with
> the the external table build on the directory. I tried both ORC and Parquet
> but I don't think the query itself is the issue.
>
> Meaning it does not matter how clever your execution engine is, the fact
> you still have to do  considerable amount of Physical IO (PIO) as opposed
> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>
> One option is to limit the amount of data in Zeppelin to certain number of
> rows or something similar. However, you cannot tell a user he/she cannot
> see the full data.
>
> We resolved this with Oracle by using Oracle TimesTen
> <http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html>IMDB
> to cache certain tables in memory and get them refreshed (depending on
> refresh frequency) from the underlying table in Oracle when data is
> updated). That is done through cache fusion.
>
> I was looking around and came across Alluxio <http://www.alluxio.org/>.
> Ideally I like to utilise such concept like TimesTen. Can one distribute
> Hive table data (or any table data) across the nodes cached. In that case
> we will be doing Logical IO which is about 20 times or more lightweight
> compared to Physical IO.
>
> Anyway this is the concept.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Jörn Franke <jo...@gmail.com>.

In Tableau you can use the in-memory facilities of the Tableau server.

As said, Apache Ignite could be one way. You can also use it to make Hive tables in-memory. While reducing IO can make sense, I do not think you will receive in production systems so much difference (at least not 20x). If the data is processed in parallel then IO will be done in parallel thanks to the architecture of HDFS. Oracle Exadata exploits similar concepts. The advantage of Ignite compared to e.g.Exadata would be that you have also the indexes of ORC and Parquet in-memory which avoids reading data in-memory that is not needed for the query.
That being said, even if you use in-memory it still makes sense that some data is pre-aggregated/calculated for the users based on their needs.

> On 17 Sep 2016, at 18:53, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> I am seeing similar issues when I was working on Oracle with Tableau as the dashboard.
> 
> Currently I have a batch layer that gets streaming data from
> 
> source -> Kafka -> Flume -> HDFS
> 
> It stored on HDFS as text files and a cron process sinks Hive table with the the external table build on the directory. I tried both ORC and Parquet but I don't think the query itself is the issue.
> 
> Meaning it does not matter how clever your execution engine is, the fact you still have to do  considerable amount of Physical IO (PIO) as opposed to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
> 
> One option is to limit the amount of data in Zeppelin to certain number of rows or something similar. However, you cannot tell a user he/she cannot see the full data.
> 
> We resolved this with Oracle by using Oracle TimesTen IMDB to cache certain tables in memory and get them refreshed (depending on refresh frequency) from the underlying table in Oracle when data is updated). That is done through cache fusion.
> 
> I was looking around and came across Alluxio. Ideally I like to utilise such concept like TimesTen. Can one distribute Hive table data (or any table data) across the nodes cached. In that case we will be doing Logical IO which is about 20 times or more lightweight compared to Physical IO.
> 
> Anyway this is the concept.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Posted by Gene Pang <ge...@gmail.com>.

Hi Mich,

While Alluxio is not a database (it exposes a file system interface), you
can use Alluxio to keep certain data in memory. With Alluxio, you can
selectively pin data in memory (http://www.alluxio.org/docs/
master/en/Command-Line-Interface.html#pin). There are also ways to control
how to read and write the data in Alluxio memory (
http://www.alluxio.org/docs/master/en/File-System-API.html). These options
and features can help you control how you access your data.

Thanks,
Gene

On Sat, Sep 17, 2016 at 9:53 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> I am seeing similar issues when I was working on Oracle with Tableau as
> the dashboard.
>
> Currently I have a batch layer that gets streaming data from
>
> source -> Kafka -> Flume -> HDFS
>
> It stored on HDFS as text files and a cron process sinks Hive table with
> the the external table build on the directory. I tried both ORC and Parquet
> but I don't think the query itself is the issue.
>
> Meaning it does not matter how clever your execution engine is, the fact
> you still have to do  considerable amount of Physical IO (PIO) as opposed
> to Logical IO (LIO) to get the data to Zeppelin is on the critical path.
>
> One option is to limit the amount of data in Zeppelin to certain number of
> rows or something similar. However, you cannot tell a user he/she cannot
> see the full data.
>
> We resolved this with Oracle by using Oracle TimesTen
> <http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html>IMDB
> to cache certain tables in memory and get them refreshed (depending on
> refresh frequency) from the underlying table in Oracle when data is
> updated). That is done through cache fusion.
>
> I was looking around and came across Alluxio <http://www.alluxio.org/>.
> Ideally I like to utilise such concept like TimesTen. Can one distribute
> Hive table data (or any table data) across the nodes cached. In that case
> we will be doing Logical IO which is about 20 times or more lightweight
> compared to Physical IO.
>
> Anyway this is the concept.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>