You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by weoccc <we...@gmail.com> on 2022/01/09 06:45:32 UTC

hive table with large column data size

Hi ,

I want to store binary data (such as images) into hive table but the binary
data column might be much larger than other columns per row.  I'm worried
about the query performance. One way I can think of is to separate binary
data storage from other columns by creating 2 hive tables and run 2
separate spark query and join them later.

Later, I found parquet has supported column split into different files as
shown here:
https://parquet.apache.org/documentation/latest/

I'm wondering if spark sql already supports that ? If so, how to use ?

Weide

Re: hive table with large column data size

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

As always, before answering the question, can I please ask what are you
trying to achieve by storing the data in a table? How are you planning to
query a binary data?

If you look at any relational theory, then it states that a table is a
relation/ entity and the fields the attributes. You might consider an image
to be an attribute of a tuple (or record) belonging to a particular
relation, but there might be more efficient methods of storing the binary
data, but it all depends on what are you trying to do?

For the data types please look here:
https://spark.apache.org/docs/latest/sql-ref-datatypes.html. Parquet is
definitely a columnar format, and if I am not entirely wrong, it definitely
supports columnar reading of data by default in SPARK.


Regards,
Gourav Sengupta

On Sun, Jan 9, 2022 at 2:34 PM weoccc <we...@gmail.com> wrote:

> Hi ,
>
> I want to store binary data (such as images) into hive table but the
> binary data column might be much larger than other columns per row.  I'm
> worried about the query performance. One way I can think of is to separate
> binary data storage from other columns by creating 2 hive tables and run 2
> separate spark query and join them later.
>
> Later, I found parquet has supported column split into different files as
> shown here:
> https://parquet.apache.org/documentation/latest/
>
> I'm wondering if spark sql already supports that ? If so, how to use ?
>
> Weide
>

Re: hive table with large column data size

Posted by Jörn Franke <jo...@gmail.com>.
It is not a good practice to do this. Just store a reference to the binary data stored on HDFS.

> Am 09.01.2022 um 15:34 schrieb weoccc <we...@gmail.com>:
> 
> 
> Hi ,
> 
> I want to store binary data (such as images) into hive table but the binary data column might be much larger than other columns per row.  I'm worried about the query performance. One way I can think of is to separate binary data storage from other columns by creating 2 hive tables and run 2 separate spark query and join them later. 
> 
> Later, I found parquet has supported column split into different files as shown here: 
> https://parquet.apache.org/documentation/latest/
> 
> I'm wondering if spark sql already supports that ? If so, how to use ? 
> 
> Weide