You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by janardhan shetty <ja...@gmail.com> on 2016/07/26 02:09:24 UTC

ORC v/s Parquet for Spark 2.0

Just wondering advantages and disadvantages to convert data into ORC or
Parquet.

In the documentation of Spark there are numerous examples of Parquet
format.

Any strong reasons to chose Parquet over ORC file format ?

Also : current data compression is bzip2

http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
This seems like biased.

Re:Re: ORC v/s Parquet for Spark 2.0

Posted by prosp4300 <pr...@163.com>.
Thanks for this immediate correction :)


在 2016-07-27 15:17:54,"Gourav Sengupta" <go...@gmail.com> 写道:

Sorry, 


in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning where people were trying to figure out its name.




Regards,
Gourav Sengupta


On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta <go...@gmail.com> wrote:

Gosh,


whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.


Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.


Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)




Regards,
Gourav


On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:

Just correction:


ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 


Do not know If Spark leveraging this new repo?


<dependency>
 <groupId>org.apache.orc</groupId>
    <artifactId>orc</artifactId>
    <version>1.1.2</version>
    <type>pom</type>
</dependency>













Sent from my iPhone
On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:


parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.


orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.



On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:

Interesting opinion, thank you


Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].


Other than this presentation [3], do you guys know any other benchmark?


[1]https://parquet.apache.org/documentation/latest/
[2]https://orc.apache.org/docs/
[3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet


On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:



when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes



On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:

I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 


If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.

On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:


Just wondering advantages and disadvantages to convert data into ORC or Parquet.


In the documentation of Spark there are numerous examples of Parquet format.



Any strong reasons to chose Parquet over ORC file format ?


Also : current data compression is bzip2



http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
This seems like biased.









Re: ORC v/s Parquet for Spark 2.0

Posted by Gourav Sengupta <go...@gmail.com>.
Sorry,

in my email above I was referring to KUDU, and there is goes how can KUDU
be right if it is mentioned in forums first with a wrong spelling. Its got
a difficult beginning where people were trying to figure out its name.


Regards,
Gourav Sengupta

On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothineni@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.marcu@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>
>>>> I think both are very similar, but with slightly different goals. While
>>>> they work transparently for each Hadoop application you need to enable
>>>> specific support in the application for predicate push down.
>>>> In the end you have to check which application you are using and do
>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>> that both formats work best if they are sorted on filter columns (which is
>>>> your responsibility) and if their optimatizations are correctly configured
>>>> (min max index, bloom filter, compression etc) .
>>>>
>>>> If you need to ingest sensor data you may want to store it first in
>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>
>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> Just wondering advantages and disadvantages to convert data into ORC or
>>>> Parquet.
>>>>
>>>> In the documentation of Spark there are numerous examples of Parquet
>>>> format.
>>>>
>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>
>>>> Also : current data compression is bzip2
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>> This seems like biased.
>>>>
>>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by ayan guha <gu...@gmail.com>.
Because everyone is here discussing this ever-changing-for-better-reason
topic of storage formats and serdes, any opinion/thoughts/experience with
Apache Arrow? It sounds like a nice idea, but how ready is it?

On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke <jo...@gmail.com> wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengupta@gmail.com
> >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothineni@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.marcu@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>
>>>> I think both are very similar, but with slightly different goals. While
>>>> they work transparently for each Hadoop application you need to enable
>>>> specific support in the application for predicate push down.
>>>> In the end you have to check which application you are using and do
>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>> that both formats work best if they are sorted on filter columns (which is
>>>> your responsibility) and if their optimatizations are correctly configured
>>>> (min max index, bloom filter, compression etc) .
>>>>
>>>> If you need to ingest sensor data you may want to store it first in
>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>
>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> Just wondering advantages and disadvantages to convert data into ORC or
>>>> Parquet.
>>>>
>>>> In the documentation of Spark there are numerous examples of Parquet
>>>> format.
>>>>
>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>
>>>> Also : current data compression is bzip2
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>> This seems like biased.
>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: ORC v/s Parquet for Spark 2.0

Posted by Jörn Franke <jo...@gmail.com>.
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case).   I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com" <Uw...@Moosheimer.com> wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <go...@gmail.com>:
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> <dependency>
>>>  <groupId>org.apache.orc</groupId>
>>>     <artifactId>orc</artifactId>
>>>     <version>1.1.2</version>
>>>     <type>pom</type>
>>> </dependency>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>> 
>>> 
>>>> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
>>>> 
>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
>>>> 
>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>>>>> Interesting opinion, thank you
>>>>> 
>>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>>> 
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>> 
>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>> [2]https://orc.apache.org/docs/
>>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>> 
>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>> 
>>>>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>>> 
>>>>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>>>>> 
>>>>>> 
>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>>>>> 
>>>>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>> 
>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>>>>> 
>>>>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>>>>> 
>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>> 
>>>>>>>> Also : current data compression is bzip2
>>>>>>>> 
>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>>>>> This seems like biased.
>> 

Re: ORC v/s Parquet for Spark 2.0

Posted by Alexander Pivovarov <ap...@gmail.com>.
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark
User List <http://apache-spark-user-list.1001560.n3.nabble.com/>
http://apache-spark-user-list.1001560.n3.nabble.com/

Anyone have a link to this discussion? Want to share it with my colleagues.

On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> As far as I know Spark still lacks the ability to handle Updates or
> deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC
> transactional table can handle updates and deletes. Transactional support
> was added to Hive for ORC tables. No transactional support with Spark SQL
> on ORC tables yet. Locking and concurrency (as used by Hive) with Spark
> app running a Hive context. I am not convinced this works actually. Case in
> point, you can test it for yourself in Spark and see whether locks are
> applied in Hive metastore . In my opinion Spark value comes as a query tool
> for faster query processing (DAG + IM capability)
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 18:46, Ofir Manor <of...@equalum.io> wrote:
>
>> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
>> personally think both are great at this point).
>> But the original question was about Spark 2.0. Anyone has some insights
>> about Parquet-specific optimizations / limitations vs. ORC-specific
>> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
>> beginning of the thread regarding Structured Streaming, but there was a
>> general claim that pre-2.0 Spark was missing many ORC optimizations, and
>> that some (all?) were added in 2.0.
>> I saw that a lot of related tickets closed in 2.0, but it would great if
>> someone close to the details can explain.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>
>> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Like anything else your mileage varies.
>>>
>>> ORC with Vectorised query execution
>>> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> is
>>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>>> with columnar indexes. To me that is cool. Parquet has been around and has
>>> its use case as well.
>>>
>>> I guess there is no hard and fast rule which one to use all the time.
>>> Use the one that provides best fit for the condition.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 09:18, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>>> I see it more as a process of innovation and thus competition is good.
>>>> Companies just should not follow these religious arguments but try
>>>> themselves what suits them. There is more than software when using software
>>>> ;)
>>>>
>>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> And frankly this is becoming some sort of religious arguments now
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> It depends on what you are dong, here is the recent comparison of ORC,
>>>>> Parquet
>>>>>
>>>>>
>>>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>
>>>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>>>> good.
>>>>>
>>>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>>>> ORC is by Hortonworks, so battle of file format continues...
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seems like parquet format is better comparatively to orc when the
>>>>> dataset is log data without nested structures? Is this fair understanding ?
>>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>
>>>>>> Kudu has been from my impression be designed to offer somethings
>>>>>> between hbase and parquet for write intensive loads - it is not faster for
>>>>>> warehouse type of querying compared to parquet (merely slower, because that
>>>>>> is not its use case).   I assume this is still the strategy of it.
>>>>>>
>>>>>> For some scenarios it could make sense together with parquet and Orc.
>>>>>> However I am not sure what the advantage towards using hbase + parquet and
>>>>>> Orc.
>>>>>>
>>>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
>>>>>> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>>>>>>
>>>>>> Hi Gourav,
>>>>>>
>>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a
>>>>>> in memory db with data storage while Parquet is "only" a columnar
>>>>>> storage format.
>>>>>>
>>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok
>>>>>> ... that's more a wish :-).
>>>>>>
>>>>>> Regards,
>>>>>> Uwe
>>>>>>
>>>>>> Mit freundlichen Grüßen / best regards
>>>>>> Kay-Uwe Moosheimer
>>>>>>
>>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>>>>> gourav.sengupta@gmail.com>:
>>>>>>
>>>>>> Gosh,
>>>>>>
>>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ
>>>>>> at a speed that is better than SPARK.
>>>>>>
>>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>>>>> someone might just start saying that KUDA has difficult lineage as well.
>>>>>> After all dynastic rules dictate.
>>>>>>
>>>>>> Personally I feel that if something stores my data compressed and
>>>>>> makes me access it faster I do not care where it comes from or how
>>>>>> difficult the child birth was :)
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Gourav
>>>>>>
>>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>>>>> sbpothineni@gmail.com> wrote:
>>>>>>
>>>>>>> Just correction:
>>>>>>>
>>>>>>> ORC Java libraries from Hive are forked into Apache ORC.
>>>>>>> Vectorization default.
>>>>>>>
>>>>>>> Do not know If Spark leveraging this new repo?
>>>>>>>
>>>>>>> <dependency>
>>>>>>>  <groupId>org.apache.orc</groupId>
>>>>>>>     <artifactId>orc</artifactId>
>>>>>>>     <version>1.1.2</version>
>>>>>>>     <type>pom</type>
>>>>>>> </dependency>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> parquet was inspired by dremel but written from the ground up as a
>>>>>>> library with support for a variety of big data systems (hive, pig, impala,
>>>>>>> cascading, etc.). it is also easy to add new support, since its a proper
>>>>>>> library.
>>>>>>>
>>>>>>> orc bas been enhanced while deployed at facebook in hive and at
>>>>>>> yahoo in hive. just hive. it didn't really exist by itself. it was part of
>>>>>>> the big java soup that is called hive, without an easy way to extract it.
>>>>>>> hive does not expose proper java apis. it never cared for that.
>>>>>>>
>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>>>>
>>>>>>>> Interesting opinion, thank you
>>>>>>>>
>>>>>>>> Still, on the website parquet is basically inspired by Dremel
>>>>>>>> (Google) [1] and part of orc has been enhanced while deployed for Facebook,
>>>>>>>> Yahoo [2].
>>>>>>>>
>>>>>>>> Other than this presentation [3], do you guys know any other
>>>>>>>> benchmark?
>>>>>>>>
>>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>>> [3]
>>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>>>
>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>>
>>>>>>>> when parquet came out it was developed by a community of companies,
>>>>>>>> and was designed as a library to be supported by multiple big data
>>>>>>>> projects. nice
>>>>>>>>
>>>>>>>> orc on the other hand initially only supported hive. it wasn't even
>>>>>>>> designed as a library that can be re-used. even today it brings in the
>>>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>>>>
>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think both are very similar, but with slightly different goals.
>>>>>>>>> While they work transparently for each Hadoop application you need to
>>>>>>>>> enable specific support in the application for predicate push down.
>>>>>>>>> In the end you have to check which application you are using and
>>>>>>>>> do some tests (with correct predicate push down configuration). Keep in
>>>>>>>>> mind that both formats work best if they are sorted on filter columns
>>>>>>>>> (which is your responsibility) and if their optimatizations are correctly
>>>>>>>>> configured (min max index, bloom filter, compression etc) .
>>>>>>>>>
>>>>>>>>> If you need to ingest sensor data you may want to store it first
>>>>>>>>> in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>>>
>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Just wondering advantages and disadvantages to convert data into
>>>>>>>>> ORC or Parquet.
>>>>>>>>>
>>>>>>>>> In the documentation of Spark there are numerous examples of
>>>>>>>>> Parquet format.
>>>>>>>>>
>>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>>>
>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>>> This seems like biased.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Mich Talebzadeh <mi...@gmail.com>.
As far as I know Spark still lacks the ability to handle Updates or deletes
vis-à-vis ORC transactional tables. As you may know in Hive an ORC
transactional table can handle updates and deletes. Transactional support
was added to Hive for ORC tables. No transactional support with Spark SQL
on ORC tables yet. Locking and concurrency (as used by Hive) with Spark app
running a Hive context. I am not convinced this works actually. Case in
point, you can test it for yourself in Spark and see whether locks are
applied in Hive metastore . In my opinion Spark value comes as a query tool
for faster query processing (DAG + IM capability)

HTH





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 18:46, Ofir Manor <of...@equalum.io> wrote:

> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
> personally think both are great at this point).
> But the original question was about Spark 2.0. Anyone has some insights
> about Parquet-specific optimizations / limitations vs. ORC-specific
> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
> beginning of the thread regarding Structured Streaming, but there was a
> general claim that pre-2.0 Spark was missing many ORC optimizations, and
> that some (all?) were added in 2.0.
> I saw that a lot of related tickets closed in 2.0, but it would great if
> someone close to the details can explain.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>
> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Like anything else your mileage varies.
>>
>> ORC with Vectorised query execution
>> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> is
>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>> with columnar indexes. To me that is cool. Parquet has been around and has
>> its use case as well.
>>
>> I guess there is no hard and fast rule which one to use all the time. Use
>> the one that provides best fit for the condition.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 09:18, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> I see it more as a process of innovation and thus competition is good.
>>> Companies just should not follow these religious arguments but try
>>> themselves what suits them. There is more than software when using software
>>> ;)
>>>
>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> And frankly this is becoming some sort of religious arguments now
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com>
>>> wrote:
>>>
>>>> It depends on what you are dong, here is the recent comparison of ORC,
>>>> Parquet
>>>>
>>>>
>>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>
>>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>>> good.
>>>>
>>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>>> ORC is by Hortonworks, so battle of file format continues...
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> Seems like parquet format is better comparatively to orc when the
>>>> dataset is log data without nested structures? Is this fair understanding ?
>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>
>>>>> Kudu has been from my impression be designed to offer somethings
>>>>> between hbase and parquet for write intensive loads - it is not faster for
>>>>> warehouse type of querying compared to parquet (merely slower, because that
>>>>> is not its use case).   I assume this is still the strategy of it.
>>>>>
>>>>> For some scenarios it could make sense together with parquet and Orc.
>>>>> However I am not sure what the advantage towards using hbase + parquet and
>>>>> Orc.
>>>>>
>>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
>>>>> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>>>>>
>>>>> Hi Gourav,
>>>>>
>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a
>>>>> in memory db with data storage while Parquet is "only" a columnar
>>>>> storage format.
>>>>>
>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok
>>>>> ... that's more a wish :-).
>>>>>
>>>>> Regards,
>>>>> Uwe
>>>>>
>>>>> Mit freundlichen Grüßen / best regards
>>>>> Kay-Uwe Moosheimer
>>>>>
>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>>>> gourav.sengupta@gmail.com>:
>>>>>
>>>>> Gosh,
>>>>>
>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ
>>>>> at a speed that is better than SPARK.
>>>>>
>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>>>> someone might just start saying that KUDA has difficult lineage as well.
>>>>> After all dynastic rules dictate.
>>>>>
>>>>> Personally I feel that if something stores my data compressed and
>>>>> makes me access it faster I do not care where it comes from or how
>>>>> difficult the child birth was :)
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>>>> sbpothineni@gmail.com> wrote:
>>>>>
>>>>>> Just correction:
>>>>>>
>>>>>> ORC Java libraries from Hive are forked into Apache ORC.
>>>>>> Vectorization default.
>>>>>>
>>>>>> Do not know If Spark leveraging this new repo?
>>>>>>
>>>>>> <dependency>
>>>>>>  <groupId>org.apache.orc</groupId>
>>>>>>     <artifactId>orc</artifactId>
>>>>>>     <version>1.1.2</version>
>>>>>>     <type>pom</type>
>>>>>> </dependency>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from my iPhone
>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>
>>>>>> parquet was inspired by dremel but written from the ground up as a
>>>>>> library with support for a variety of big data systems (hive, pig, impala,
>>>>>> cascading, etc.). it is also easy to add new support, since its a proper
>>>>>> library.
>>>>>>
>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo
>>>>>> in hive. just hive. it didn't really exist by itself. it was part of the
>>>>>> big java soup that is called hive, without an easy way to extract it. hive
>>>>>> does not expose proper java apis. it never cared for that.
>>>>>>
>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>>>
>>>>>>> Interesting opinion, thank you
>>>>>>>
>>>>>>> Still, on the website parquet is basically inspired by Dremel
>>>>>>> (Google) [1] and part of orc has been enhanced while deployed for Facebook,
>>>>>>> Yahoo [2].
>>>>>>>
>>>>>>> Other than this presentation [3], do you guys know any other
>>>>>>> benchmark?
>>>>>>>
>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>> [3]
>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>>
>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>
>>>>>>> when parquet came out it was developed by a community of companies,
>>>>>>> and was designed as a library to be supported by multiple big data
>>>>>>> projects. nice
>>>>>>>
>>>>>>> orc on the other hand initially only supported hive. it wasn't even
>>>>>>> designed as a library that can be re-used. even today it brings in the
>>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>>>
>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think both are very similar, but with slightly different goals.
>>>>>>>> While they work transparently for each Hadoop application you need to
>>>>>>>> enable specific support in the application for predicate push down.
>>>>>>>> In the end you have to check which application you are using and do
>>>>>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>>>>>> that both formats work best if they are sorted on filter columns (which is
>>>>>>>> your responsibility) and if their optimatizations are correctly configured
>>>>>>>> (min max index, bloom filter, compression etc) .
>>>>>>>>
>>>>>>>> If you need to ingest sensor data you may want to store it first in
>>>>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>>
>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Just wondering advantages and disadvantages to convert data into
>>>>>>>> ORC or Parquet.
>>>>>>>>
>>>>>>>> In the documentation of Spark there are numerous examples of
>>>>>>>> Parquet format.
>>>>>>>>
>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>>
>>>>>>>> Also : current data compression is bzip2
>>>>>>>>
>>>>>>>>
>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>> This seems like biased.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Ofir Manor <of...@equalum.io>.
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
personally think both are great at this point).
But the original question was about Spark 2.0. Anyone has some insights
about Parquet-specific optimizations / limitations vs. ORC-specific
optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
beginning of the thread regarding Structured Streaming, but there was a
general claim that pre-2.0 Spark was missing many ORC optimizations, and
that some (all?) were added in 2.0.
I saw that a lot of related tickets closed in 2.0, but it would great if
someone close to the details can explain.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Like anything else your mileage varies.
>
> ORC with Vectorised query execution
> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> is
> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
> with columnar indexes. To me that is cool. Parquet has been around and has
> its use case as well.
>
> I guess there is no hard and fast rule which one to use all the time. Use
> the one that provides best fit for the condition.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 09:18, Jörn Franke <jo...@gmail.com> wrote:
>
>> I see it more as a process of innovation and thus competition is good.
>> Companies just should not follow these religious arguments but try
>> themselves what suits them. There is more than software when using software
>> ;)
>>
>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> And frankly this is becoming some sort of religious arguments now
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com>
>> wrote:
>>
>>> It depends on what you are dong, here is the recent comparison of ORC,
>>> Parquet
>>>
>>>
>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>> good.
>>>
>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>> ORC is by Hortonworks, so battle of file format continues...
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com>
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the
>>> dataset is log data without nested structures? Is this fair understanding ?
>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>
>>>> Kudu has been from my impression be designed to offer somethings
>>>> between hbase and parquet for write intensive loads - it is not faster for
>>>> warehouse type of querying compared to parquet (merely slower, because that
>>>> is not its use case).   I assume this is still the strategy of it.
>>>>
>>>> For some scenarios it could make sense together with parquet and Orc.
>>>> However I am not sure what the advantage towards using hbase + parquet and
>>>> Orc.
>>>>
>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
>>>> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>>>>
>>>> Hi Gourav,
>>>>
>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>>>> memory db with data storage while Parquet is "only" a columnar
>>>> storage format.
>>>>
>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>>>> that's more a wish :-).
>>>>
>>>> Regards,
>>>> Uwe
>>>>
>>>> Mit freundlichen Grüßen / best regards
>>>> Kay-Uwe Moosheimer
>>>>
>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>>> gourav.sengupta@gmail.com>:
>>>>
>>>> Gosh,
>>>>
>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at
>>>> a speed that is better than SPARK.
>>>>
>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>>> someone might just start saying that KUDA has difficult lineage as well.
>>>> After all dynastic rules dictate.
>>>>
>>>> Personally I feel that if something stores my data compressed and makes
>>>> me access it faster I do not care where it comes from or how difficult the
>>>> child birth was :)
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>>> sbpothineni@gmail.com> wrote:
>>>>
>>>>> Just correction:
>>>>>
>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>>>> default.
>>>>>
>>>>> Do not know If Spark leveraging this new repo?
>>>>>
>>>>> <dependency>
>>>>>  <groupId>org.apache.orc</groupId>
>>>>>     <artifactId>orc</artifactId>
>>>>>     <version>1.1.2</version>
>>>>>     <type>pom</type>
>>>>> </dependency>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>
>>>>> parquet was inspired by dremel but written from the ground up as a
>>>>> library with support for a variety of big data systems (hive, pig, impala,
>>>>> cascading, etc.). it is also easy to add new support, since its a proper
>>>>> library.
>>>>>
>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo
>>>>> in hive. just hive. it didn't really exist by itself. it was part of the
>>>>> big java soup that is called hive, without an easy way to extract it. hive
>>>>> does not expose proper java apis. it never cared for that.
>>>>>
>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>>
>>>>>> Interesting opinion, thank you
>>>>>>
>>>>>> Still, on the website parquet is basically inspired by Dremel
>>>>>> (Google) [1] and part of orc has been enhanced while deployed for Facebook,
>>>>>> Yahoo [2].
>>>>>>
>>>>>> Other than this presentation [3], do you guys know any other
>>>>>> benchmark?
>>>>>>
>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>> [2]https://orc.apache.org/docs/
>>>>>> [3]
>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>
>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>
>>>>>> when parquet came out it was developed by a community of companies,
>>>>>> and was designed as a library to be supported by multiple big data
>>>>>> projects. nice
>>>>>>
>>>>>> orc on the other hand initially only supported hive. it wasn't even
>>>>>> designed as a library that can be re-used. even today it brings in the
>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>>
>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>
>>>>>>> I think both are very similar, but with slightly different goals.
>>>>>>> While they work transparently for each Hadoop application you need to
>>>>>>> enable specific support in the application for predicate push down.
>>>>>>> In the end you have to check which application you are using and do
>>>>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>>>>> that both formats work best if they are sorted on filter columns (which is
>>>>>>> your responsibility) and if their optimatizations are correctly configured
>>>>>>> (min max index, bloom filter, compression etc) .
>>>>>>>
>>>>>>> If you need to ingest sensor data you may want to store it first in
>>>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>
>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just wondering advantages and disadvantages to convert data into ORC
>>>>>>> or Parquet.
>>>>>>>
>>>>>>> In the documentation of Spark there are numerous examples of Parquet
>>>>>>> format.
>>>>>>>
>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>
>>>>>>> Also : current data compression is bzip2
>>>>>>>
>>>>>>>
>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>> This seems like biased.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Mich Talebzadeh <mi...@gmail.com>.
Like anything else your mileage varies.

ORC with Vectorised query execution
<https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution>
is
the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
with columnar indexes. To me that is cool. Parquet has been around and has
its use case as well.

I guess there is no hard and fast rule which one to use all the time. Use
the one that provides best fit for the condition.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 09:18, Jörn Franke <jo...@gmail.com> wrote:

> I see it more as a process of innovation and thus competition is good.
> Companies just should not follow these religious arguments but try
> themselves what suits them. There is more than software when using software
> ;)
>
> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> And frankly this is becoming some sort of religious arguments now
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com>
> wrote:
>
>> It depends on what you are dong, here is the recent comparison of ORC,
>> Parquet
>>
>>
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> Although from ORC authors, I thought fair comparison, We use ORC as
>> System of Record on our Cloudera HDFS cluster, our experience is so far
>> good.
>>
>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>> ORC is by Hortonworks, so battle of file format continues...
>>
>> Sent from my iPhone
>>
>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com>
>> wrote:
>>
>> Seems like parquet format is better comparatively to orc when the dataset
>> is log data without nested structures? Is this fair understanding ?
>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>>
>>> Kudu has been from my impression be designed to offer somethings between
>>> hbase and parquet for write intensive loads - it is not faster for
>>> warehouse type of querying compared to parquet (merely slower, because that
>>> is not its use case).   I assume this is still the strategy of it.
>>>
>>> For some scenarios it could make sense together with parquet and Orc.
>>> However I am not sure what the advantage towards using hbase + parquet and
>>> Orc.
>>>
>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
>>> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>>>
>>> Hi Gourav,
>>>
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>>> memory db with data storage while Parquet is "only" a columnar
>>> storage format.
>>>
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>>> that's more a wish :-).
>>>
>>> Regards,
>>> Uwe
>>>
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>>
>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>> gourav.sengupta@gmail.com>:
>>>
>>> Gosh,
>>>
>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at
>>> a speed that is better than SPARK.
>>>
>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>> someone might just start saying that KUDA has difficult lineage as well.
>>> After all dynastic rules dictate.
>>>
>>> Personally I feel that if something stores my data compressed and makes
>>> me access it faster I do not care where it comes from or how difficult the
>>> child birth was :)
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>> sbpothineni@gmail.com> wrote:
>>>
>>>> Just correction:
>>>>
>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>>> default.
>>>>
>>>> Do not know If Spark leveraging this new repo?
>>>>
>>>> <dependency>
>>>>  <groupId>org.apache.orc</groupId>
>>>>     <artifactId>orc</artifactId>
>>>>     <version>1.1.2</version>
>>>>     <type>pom</type>
>>>> </dependency>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>> parquet was inspired by dremel but written from the ground up as a
>>>> library with support for a variety of big data systems (hive, pig, impala,
>>>> cascading, etc.). it is also easy to add new support, since its a proper
>>>> library.
>>>>
>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo
>>>> in hive. just hive. it didn't really exist by itself. it was part of the
>>>> big java soup that is called hive, without an easy way to extract it. hive
>>>> does not expose proper java apis. it never cared for that.
>>>>
>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>>
>>>>> Interesting opinion, thank you
>>>>>
>>>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>>>> [2].
>>>>>
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>>
>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>> [2]https://orc.apache.org/docs/
>>>>> [3]
>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>
>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>
>>>>> when parquet came out it was developed by a community of companies,
>>>>> and was designed as a library to be supported by multiple big data
>>>>> projects. nice
>>>>>
>>>>> orc on the other hand initially only supported hive. it wasn't even
>>>>> designed as a library that can be re-used. even today it brings in the
>>>>> kitchen sink of transitive dependencies. yikes
>>>>>
>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>
>>>>>> I think both are very similar, but with slightly different goals.
>>>>>> While they work transparently for each Hadoop application you need to
>>>>>> enable specific support in the application for predicate push down.
>>>>>> In the end you have to check which application you are using and do
>>>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>>>> that both formats work best if they are sorted on filter columns (which is
>>>>>> your responsibility) and if their optimatizations are correctly configured
>>>>>> (min max index, bloom filter, compression etc) .
>>>>>>
>>>>>> If you need to ingest sensor data you may want to store it first in
>>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>>
>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just wondering advantages and disadvantages to convert data into ORC
>>>>>> or Parquet.
>>>>>>
>>>>>> In the documentation of Spark there are numerous examples of Parquet
>>>>>> format.
>>>>>>
>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>
>>>>>> Also : current data compression is bzip2
>>>>>>
>>>>>>
>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>> This seems like biased.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Jörn Franke <jo...@gmail.com>.
I see it more as a process of innovation and thus competition is good. Companies just should not follow these religious arguments but try themselves what suits them. There is more than software when using software ;)

> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> And frankly this is becoming some sort of religious arguments now
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>> It depends on what you are dong, here is the recent comparison of ORC, Parquet
>> 
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>> Although from ORC authors, I thought fair comparison, We use ORC as System of Record on our Cloudera HDFS cluster, our experience is so far good.
>> 
>> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC is by Hortonworks, so battle of file format continues...
>> 
>> Sent from my iPhone
>> 
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com> wrote:
>>> 
>>> Seems like parquet format is better comparatively to orc when the dataset is log data without nested structures? Is this fair understanding ?
>>> 
>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>> Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case).   I assume this is still the strategy of it.
>>>> 
>>>> For some scenarios it could make sense together with parquet and Orc. However I am not sure what the advantage towards using hbase + parquet and Orc.
>>>> 
>>>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com" <Uw...@Moosheimer.com> wrote:
>>>>> 
>>>>> Hi Gourav,
>>>>> 
>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.
>>>>> 
>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-).
>>>>> 
>>>>> Regards,
>>>>> Uwe
>>>>> 
>>>>> Mit freundlichen Grüßen / best regards
>>>>> Kay-Uwe Moosheimer
>>>>> 
>>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <go...@gmail.com>:
>>>>>> 
>>>>>> Gosh,
>>>>>> 
>>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
>>>>>> 
>>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.
>>>>>> 
>>>>>> Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Gourav
>>>>>> 
>>>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>>>>>>> Just correction:
>>>>>>> 
>>>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 
>>>>>>> 
>>>>>>> Do not know If Spark leveraging this new repo?
>>>>>>> 
>>>>>>> <dependency>
>>>>>>>  <groupId>org.apache.orc</groupId>
>>>>>>>     <artifactId>orc</artifactId>
>>>>>>>     <version>1.1.2</version>
>>>>>>>     <type>pom</type>
>>>>>>> </dependency>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>> 
>>>>>>> 
>>>>>>>> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
>>>>>>>> 
>>>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
>>>>>>>> 
>>>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>>>>>>>>> Interesting opinion, thank you
>>>>>>>>> 
>>>>>>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>>>>>>> 
>>>>>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>>>>>> 
>>>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>>>> 
>>>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>>>>>>> 
>>>>>>>>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>>>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>>>>>>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>>>>>>>>> 
>>>>>>>>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>>>>> 
>>>>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>>>>>>>>> 
>>>>>>>>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>>>> 
>>>>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>>>>>>>>> This seems like biased.
> 

Re: ORC v/s Parquet for Spark 2.0

Posted by Mich Talebzadeh <mi...@gmail.com>.
And frankly this is becoming some sort of religious arguments now



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sb...@gmail.com>
wrote:

> It depends on what you are dong, here is the recent comparison of ORC,
> Parquet
>
>
> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> Although from ORC authors, I thought fair comparison, We use ORC as System
> of Record on our Cloudera HDFS cluster, our experience is so far good.
>
> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC
> is by Hortonworks, so battle of file format continues...
>
> Sent from my iPhone
>
> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com>
> wrote:
>
> Seems like parquet format is better comparatively to orc when the dataset
> is log data without nested structures? Is this fair understanding ?
> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>
>> Kudu has been from my impression be designed to offer somethings between
>> hbase and parquet for write intensive loads - it is not faster for
>> warehouse type of querying compared to parquet (merely slower, because that
>> is not its use case).   I assume this is still the strategy of it.
>>
>> For some scenarios it could make sense together with parquet and Orc.
>> However I am not sure what the advantage towards using hbase + parquet and
>> Orc.
>>
>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
>> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>>
>> Hi Gourav,
>>
>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>> memory db with data storage while Parquet is "only" a columnar
>> storage format.
>>
>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>> that's more a wish :-).
>>
>> Regards,
>> Uwe
>>
>> Mit freundlichen Grüßen / best regards
>> Kay-Uwe Moosheimer
>>
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengupta@gmail.com
>> >:
>>
>> Gosh,
>>
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
>> speed that is better than SPARK.
>>
>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>> someone might just start saying that KUDA has difficult lineage as well.
>> After all dynastic rules dictate.
>>
>> Personally I feel that if something stores my data compressed and makes
>> me access it faster I do not care where it comes from or how difficult the
>> child birth was :)
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>> sbpothineni@gmail.com> wrote:
>>
>>> Just correction:
>>>
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>> default.
>>>
>>> Do not know If Spark leveraging this new repo?
>>>
>>> <dependency>
>>>  <groupId>org.apache.orc</groupId>
>>>     <artifactId>orc</artifactId>
>>>     <version>1.1.2</version>
>>>     <type>pom</type>
>>> </dependency>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> parquet was inspired by dremel but written from the ground up as a
>>> library with support for a variety of big data systems (hive, pig, impala,
>>> cascading, etc.). it is also easy to add new support, since its a proper
>>> library.
>>>
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>>> hive. just hive. it didn't really exist by itself. it was part of the big
>>> java soup that is called hive, without an easy way to extract it. hive does
>>> not expose proper java apis. it never cared for that.
>>>
>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.marcu@inria.fr> wrote:
>>>
>>>> Interesting opinion, thank you
>>>>
>>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>>> [2].
>>>>
>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>
>>>> [1]https://parquet.apache.org/documentation/latest/
>>>> [2]https://orc.apache.org/docs/
>>>> [3]
>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>
>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>> when parquet came out it was developed by a community of companies, and
>>>> was designed as a library to be supported by multiple big data projects.
>>>> nice
>>>>
>>>> orc on the other hand initially only supported hive. it wasn't even
>>>> designed as a library that can be re-used. even today it brings in the
>>>> kitchen sink of transitive dependencies. yikes
>>>>
>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>
>>>>> I think both are very similar, but with slightly different goals.
>>>>> While they work transparently for each Hadoop application you need to
>>>>> enable specific support in the application for predicate push down.
>>>>> In the end you have to check which application you are using and do
>>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>>> that both formats work best if they are sorted on filter columns (which is
>>>>> your responsibility) and if their optimatizations are correctly configured
>>>>> (min max index, bloom filter, compression etc) .
>>>>>
>>>>> If you need to ingest sensor data you may want to store it first in
>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>
>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Just wondering advantages and disadvantages to convert data into ORC
>>>>> or Parquet.
>>>>>
>>>>> In the documentation of Spark there are numerous examples of Parquet
>>>>> format.
>>>>>
>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>
>>>>> Also : current data compression is bzip2
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>> This seems like biased.
>>>>>
>>>>>
>>>>
>>>
>>

Re: ORC v/s Parquet for Spark 2.0

Posted by Sudhir Babu Pothineni <sb...@gmail.com>.
It depends on what you are dong, here is the recent comparison of ORC, Parquet

https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

Although from ORC authors, I thought fair comparison, We use ORC as System of Record on our Cloudera HDFS cluster, our experience is so far good.

Perquet is backed by Cloudera, which has more installations of Hadoop. ORC is by Hortonworks, so battle of file format continues...

Sent from my iPhone

> On Jul 27, 2016, at 4:54 PM, janardhan shetty <ja...@gmail.com> wrote:
> 
> Seems like parquet format is better comparatively to orc when the dataset is log data without nested structures? Is this fair understanding ?
> 
>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:
>> Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case).   I assume this is still the strategy of it.
>> 
>> For some scenarios it could make sense together with parquet and Orc. However I am not sure what the advantage towards using hbase + parquet and Orc.
>> 
>>> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com" <Uw...@Moosheimer.com> wrote:
>>> 
>>> Hi Gourav,
>>> 
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.
>>> 
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-).
>>> 
>>> Regards,
>>> Uwe
>>> 
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>> 
>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <go...@gmail.com>:
>>>> 
>>>> Gosh,
>>>> 
>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
>>>> 
>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.
>>>> 
>>>> Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav
>>>> 
>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>>>>> Just correction:
>>>>> 
>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 
>>>>> 
>>>>> Do not know If Spark leveraging this new repo?
>>>>> 
>>>>> <dependency>
>>>>>  <groupId>org.apache.orc</groupId>
>>>>>     <artifactId>orc</artifactId>
>>>>>     <version>1.1.2</version>
>>>>>     <type>pom</type>
>>>>> </dependency>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>> 
>>>>> 
>>>>>> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
>>>>>> 
>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
>>>>>> 
>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>>>>>>> Interesting opinion, thank you
>>>>>>> 
>>>>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>>>>> 
>>>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>>>> 
>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>> 
>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>> 
>>>>>>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>>>>> 
>>>>>>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>>>>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>>>>>>> 
>>>>>>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>>>> 
>>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>>>>>>> 
>>>>>>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>>>>>>> 
>>>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>>>> 
>>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>> 
>>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>>>>>>> This seems like biased.

Re: ORC v/s Parquet for Spark 2.0

Posted by janardhan shetty <ja...@gmail.com>.
Seems like parquet format is better comparatively to orc when the dataset
is log data without nested structures? Is this fair understanding ?
On Jul 27, 2016 1:30 PM, "Jörn Franke" <jo...@gmail.com> wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com <Uw...@moosheimer.com>" <
> Uwe@Moosheimer.com <Uw...@moosheimer.com>> wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengupta@gmail.com
> >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothineni@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.marcu@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>
>>>> I think both are very similar, but with slightly different goals. While
>>>> they work transparently for each Hadoop application you need to enable
>>>> specific support in the application for predicate push down.
>>>> In the end you have to check which application you are using and do
>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>> that both formats work best if they are sorted on filter columns (which is
>>>> your responsibility) and if their optimatizations are correctly configured
>>>> (min max index, bloom filter, compression etc) .
>>>>
>>>> If you need to ingest sensor data you may want to store it first in
>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>
>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> Just wondering advantages and disadvantages to convert data into ORC or
>>>> Parquet.
>>>>
>>>> In the documentation of Spark there are numerous examples of Parquet
>>>> format.
>>>>
>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>
>>>> Also : current data compression is bzip2
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>> This seems like biased.
>>>>
>>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Jörn Franke <jo...@gmail.com>.
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case).   I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "Uwe@Moosheimer.com" <Uw...@Moosheimer.com> wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <go...@gmail.com>:
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> <dependency>
>>>  <groupId>org.apache.orc</groupId>
>>>     <artifactId>orc</artifactId>
>>>     <version>1.1.2</version>
>>>     <type>pom</type>
>>> </dependency>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>> 
>>> 
>>>> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
>>>> 
>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
>>>> 
>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>>>>> Interesting opinion, thank you
>>>>> 
>>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>>> 
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>> 
>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>> [2]https://orc.apache.org/docs/
>>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>> 
>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>> 
>>>>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>>> 
>>>>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>>>>> 
>>>>>> 
>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>>>>> 
>>>>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>>> 
>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>>>>> 
>>>>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>>>>> 
>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>> 
>>>>>>>> Also : current data compression is bzip2
>>>>>>>> 
>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>>>>> This seems like biased.
>> 

Re: ORC v/s Parquet for Spark 2.0

Posted by "Uwe@Moosheimer.com" <Uw...@Moosheimer.com>.
Hi Gourav,

Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.

As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-).

Regards,
Uwe

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <go...@gmail.com>:
> 
> Gosh,
> 
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
> 
> Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.
> 
> Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :)
> 
> 
> Regards,
> Gourav
> 
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sb...@gmail.com> wrote:
>> Just correction:
>> 
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 
>> 
>> Do not know If Spark leveraging this new repo?
>> 
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>> 
>> 
>>> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
>>> 
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
>>> 
>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>>>> Interesting opinion, thank you
>>>> 
>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>> 
>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>> 
>>>> [1]https://parquet.apache.org/documentation/latest/
>>>> [2]https://orc.apache.org/docs/
>>>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>> 
>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>> 
>>>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>>>> 
>>>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>>>> 
>>>>> 
>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>>>> 
>>>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>>>> 
>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>>>> 
>>>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>>>> 
>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>> 
>>>>>>> Also : current data compression is bzip2
>>>>>>> 
>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>>>> This seems like biased.
> 

Re: ORC v/s Parquet for Spark 2.0

Posted by Gourav Sengupta <go...@gmail.com>.
Gosh,

whether ORC came from this or that, it runs queries in HIVE with TEZ at a
speed that is better than SPARK.

Has anyone heard of KUDA? Its better than Parquet. But I think that someone
might just start saying that KUDA has difficult lineage as well. After all
dynastic rules dictate.

Personally I feel that if something stores my data compressed and makes me
access it faster I do not care where it comes from or how difficult the
child birth was :)


Regards,
Gourav

On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
sbpothineni@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> <dependency>
>  <groupId>org.apache.orc</groupId>
>     <artifactId>orc</artifactId>
>     <version>1.1.2</version>
>     <type>pom</type>
> </dependency>
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.
i dont think so, but that sounds like a good idea

On Tue, Jul 26, 2016 at 6:19 PM, Sudhir Babu Pothineni <
sbpothineni@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> <dependency>
>  <groupId>org.apache.orc</groupId>
>     <artifactId>orc</artifactId>
>     <version>1.1.2</version>
>     <type>pom</type>
> </dependency>
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.marcu@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Sudhir Babu Pothineni <sb...@gmail.com>.
Just correction:

ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 

Do not know If Spark leveraging this new repo?

<dependency>
 <groupId>org.apache.orc</groupId>
    <artifactId>orc</artifactId>
    <version>1.1.2</version>
    <type>pom</type>
</dependency>








Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library.
> 
> orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. just hive. it didn't really exist by itself. it was part of the big java soup that is called hive, without an easy way to extract it. hive does not expose proper java apis. it never cared for that.
> 
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <ov...@inria.fr> wrote:
>> Interesting opinion, thank you
>> 
>> Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>> 
>> Other than this presentation [3], do you guys know any other benchmark?
>> 
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>> 
>>> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
>>> 
>>> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
>>> 
>>> 
>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>>>> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
>>>> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
>>>> 
>>>> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
>>>> 
>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>>>>> 
>>>>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>>>>> 
>>>>> In the documentation of Spark there are numerous examples of Parquet format. 
>>>>> 
>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>> 
>>>>> Also : current data compression is bzip2
>>>>> 
>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
>>>>> This seems like biased.
> 

Re: ORC v/s Parquet for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.
parquet was inspired by dremel but written from the ground up as a library
with support for a variety of big data systems (hive, pig, impala,
cascading, etc.). it is also easy to add new support, since its a proper
library.

orc bas been enhanced while deployed at facebook in hive and at yahoo in
hive. just hive. it didn't really exist by itself. it was part of the big
java soup that is called hive, without an easy way to extract it. hive does
not expose proper java apis. it never cared for that.

On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr> wrote:

> Interesting opinion, thank you
>
> Still, on the website parquet is basically inspired by Dremel (Google) [1]
> and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>
> Other than this presentation [3], do you guys know any other benchmark?
>
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3]
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>
> when parquet came out it was developed by a community of companies, and
> was designed as a library to be supported by multiple big data projects.
> nice
>
> orc on the other hand initially only supported hive. it wasn't even
> designed as a library that can be re-used. even today it brings in the
> kitchen sink of transitive dependencies. yikes
>
> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:
>
>> I think both are very similar, but with slightly different goals. While
>> they work transparently for each Hadoop application you need to enable
>> specific support in the application for predicate push down.
>> In the end you have to check which application you are using and do some
>> tests (with correct predicate push down configuration). Keep in mind that
>> both formats work best if they are sorted on filter columns (which is your
>> responsibility) and if their optimatizations are correctly configured (min
>> max index, bloom filter, compression etc) .
>>
>> If you need to ingest sensor data you may want to store it first in hbase
>> and then batch process it in large files in Orc or parquet format.
>>
>> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com>
>> wrote:
>>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.
Interesting opinion, thank you

Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2].

Other than this presentation [3], do you guys know any other benchmark?

[1]https://parquet.apache.org/documentation/latest/ <https://parquet.apache.org/documentation/latest/>
[2]https://orc.apache.org/docs/ <https://orc.apache.org/docs/>
[3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet <http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet>

> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
> 
> when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice
> 
> orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen sink of transitive dependencies. yikes
> 
> 
> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
> In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 
> 
> If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.
> 
> On 26 Jul 2016, at 04:09, janardhan shetty <janardhanp22@gmail.com <ma...@gmail.com>> wrote:
> 
>> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
>> 
>> In the documentation of Spark there are numerous examples of Parquet format. 
>> 
>> Any strong reasons to chose Parquet over ORC file format ?
>> 
>> Also : current data compression is bzip2
>> 
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy <http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy> 
>> This seems like biased.


Re: ORC v/s Parquet for Spark 2.0

Posted by Koert Kuipers <ko...@tresata.com>.
when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen sink of transitive dependencies. yikes

On Jul 26, 2016 5:09 AM, "Jörn Franke" <jo...@gmail.com> wrote:

> I think both are very similar, but with slightly different goals. While
> they work transparently for each Hadoop application you need to enable
> specific support in the application for predicate push down.
> In the end you have to check which application you are using and do some
> tests (with correct predicate push down configuration). Keep in mind that
> both formats work best if they are sorted on filter columns (which is your
> responsibility) and if their optimatizations are correctly configured (min
> max index, bloom filter, compression etc) .
>
> If you need to ingest sensor data you may want to store it first in hbase
> and then batch process it in large files in Orc or parquet format.
>
> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
>
> Just wondering advantages and disadvantages to convert data into ORC or
> Parquet.
>
> In the documentation of Spark there are numerous examples of Parquet
> format.
>
> Any strong reasons to chose Parquet over ORC file format ?
>
> Also : current data compression is bzip2
>
>
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
> This seems like biased.
>
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.
So did you tried actually to run your use case with spark 2.0 and orc files?
It’s hard to understand your ‘apparently..’.

Best,
Ovidiu
> On 26 Jul 2016, at 13:10, Gourav Sengupta <go...@gmail.com> wrote:
> 
> If you have ever tried to use ORC via SPARK you will know that SPARK's promise of accessing ORC files is a sham. SPARK cannot access partitioned tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster and what more, if you are using SQL and have thought of using HIVE with ORC on TEZ, then it runs way better, faster and leaner than SPARK. 
> 
> I can process almost a few billion records close to a terabyte in a cluster with around 100GB RAM and 40 cores in a few hours, and find it a challenge doing the same with SPARK. 
> 
> But apparently, everything is resolved in SPARK 2.0.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <ofir.manor@equalum.io <ma...@equalum.io>> wrote:
> One additional point specific to Spark 2.0 - for the alpha Structured Streaming API (only),  the file sink only supports Parquet format (I'm sure that limitation will be lifted in a future release before Structured Streaming is GA):
>      "File sink - Stores the output to a directory. As of Spark 2.0, this only supports Parquet file format, and Append output mode."
>      http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here>
> 
> ​
> 


Re: ORC v/s Parquet for Spark 2.0

Posted by Gourav Sengupta <go...@gmail.com>.
If you have ever tried to use ORC via SPARK you will know that SPARK's
promise of accessing ORC files is a sham. SPARK cannot access partitioned
tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC
faster and what more, if you are using SQL and have thought of using HIVE
with ORC on TEZ, then it runs way better, faster and leaner than SPARK.

I can process almost a few billion records close to a terabyte in a cluster
with around 100GB RAM and 40 cores in a few hours, and find it a challenge
doing the same with SPARK.

But apparently, everything is resolved in SPARK 2.0.


Regards,
Gourav Sengupta

On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <of...@equalum.io> wrote:

> One additional point specific to Spark 2.0 - for the alpha Structured
> Streaming API (only),  the file sink only supports Parquet format (I'm sure
> that limitation will be lifted in a future release before Structured
> Streaming is GA):
>      "File sink - Stores the output to a directory. As of Spark 2.0, this
> only supports Parquet file format, and Append output mode."
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>
> ​
>

Re: ORC v/s Parquet for Spark 2.0

Posted by Ofir Manor <of...@equalum.io>.
One additional point specific to Spark 2.0 - for the alpha Structured
Streaming API (only),  the file sink only supports Parquet format (I'm sure
that limitation will be lifted in a future release before Structured
Streaming is GA):
     "File sink - Stores the output to a directory. As of Spark 2.0, this
only supports Parquet file format, and Append output mode."

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here

​

Re: ORC v/s Parquet for Spark 2.0

Posted by Jörn Franke <jo...@gmail.com>.
I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests (with correct predicate push down configuration). Keep in mind that both formats work best if they are sorted on filter columns (which is your responsibility) and if their optimatizations are correctly configured (min max index, bloom filter, compression etc) . 

If you need to ingest sensor data you may want to store it first in hbase and then batch process it in large files in Orc or parquet format.

> On 26 Jul 2016, at 04:09, janardhan shetty <ja...@gmail.com> wrote:
> 
> Just wondering advantages and disadvantages to convert data into ORC or Parquet. 
> 
> In the documentation of Spark there are numerous examples of Parquet format. 
> 
> Any strong reasons to chose Parquet over ORC file format ?
> 
> Also : current data compression is bzip2
> 
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
> This seems like biased.

Re: ORC v/s Parquet for Spark 2.0

Posted by janardhan shetty <ja...@gmail.com>.
Thanks Timur for the explanation.
What about if the data is  log-data which is delimited(csv or tsv) and
doesn't have too many nestings and are in file formats ?

On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao <ts...@timshenkao.su> wrote:

> 1) The opinions on StackOverflow are correct, not biased.
> 2) Cloudera promoted Parquet, Hortonworks - ORC + Tez. When it became
> obvious that just file format is not enough and Impala sucks, then Cloudera
> announced https://vision.cloudera.com/one-platform/ and focused on Spark
> 3) There is a race between ORC & Parquet: after some perfect release ORC
> becomes better & faster, then, several months later, Parquet may outperform.
> 4) If you use "flat" tables --> ORC is better. If you have highly nested
> data with arrays inside of dictionaries (for instance, json that isn't
> flattened) then may be one should choose Parquet
> 5) AFAIK, Parquet has its metadata at the end of the file (correct me if
> something has changed) . It means that Parquet file must be completely read
> & put into RAM. If there is no enough RAM or file somehow is corrupted -->
> problems arise
>
> On Tue, Jul 26, 2016 at 5:09 AM, janardhan shetty <ja...@gmail.com>
> wrote:
>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>
>