You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vitaliy Pisarev <vi...@biocatch.com> on 2018/04/08 17:52:19 UTC

Does joining table in Spark multiplies selected columns of smaller table?

I have two tables in spark:

T1
|--x1
|--x2

T2
|--z1
|--z2


   - T1 is much larger than T2
   - The values in column z2 are *very large*
   - There is a Many-One relationships between T1 and T2 respectively (via
   the x2 and z1 columns).

I perform the following query:

select T1.x1, T2.z2 from T1
join T2 on T1.x2 = T2.z1

In the resulting data set, the same value from T2.z2 will be multiplied to
many values of T1.x1.

Since this value is very heavy- I am concerned whether the data is actually
duplicated or whether there are internal optimisations that maintain only
references?
p.s
Originally posted on SO <https://stackoverflow.com/q/49716385/180650>

Re: Does joining table in Spark multiplies selected columns of smaller table?

Posted by Vitaliy Pisarev <vi...@biocatch.com>.

The value is already stored in azure blob store and the entities in T1
reference it. My problem is that in the computation I need to run, in order
to fetch the referenced value I pay a very large i/o penalty.

The reason being that this is done once per record in T1, which may contain
1 million records.

Fortunately, I have the referenced values stored in parquet, so I figured
I'd try a different access pattern.

On Sun, Apr 8, 2018, 20:58 Jörn Franke <jo...@gmail.com> wrote:

> What do you mean the value is very large in t2? How large? What is it? You
> could put the large data in separate files on HDFS and just maintain a file
> name in the table.
>
> On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vi...@biocatch.com>
> wrote:
>
> I have two tables in spark:
>
> T1
> |--x1
> |--x2
>
> T2
> |--z1
> |--z2
>
>
>    - T1 is much larger than T2
>    - The values in column z2 are *very large*
>    - There is a Many-One relationships between T1 and T2 respectively
>    (via the x2 and z1 columns).
>
> I perform the following query:
>
> select T1.x1, T2.z2 from T1
> join T2 on T1.x2 = T2.z1
>
> In the resulting data set, the same value from T2.z2 will be multiplied to
> many values of T1.x1.
>
> Since this value is very heavy- I am concerned whether the data is
> actually duplicated or whether there are internal optimisations that
> maintain only references?
> p.s
> Originally posted on SO <https://stackoverflow.com/q/49716385/180650>
>
>

Re: Does joining table in Spark multiplies selected columns of smaller table?

Posted by Jörn Franke <jo...@gmail.com>.

What do you mean the value is very large in t2? How large? What is it? You could put the large data in separate files on HDFS and just maintain a file name in the table. 

> On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vi...@biocatch.com> wrote:
> 
> I have two tables in spark:
> 
> T1
> |--x1
> |--x2
> 
> T2
> |--z1
> |--z2
> T1 is much larger than T2
> The values in column z2 are very large
> There is a Many-One relationships between T1 and T2 respectively (via the x2 and z1 columns).
> I perform the following query:
> 
> select T1.x1, T2.z2 from T1
> join T2 on T1.x2 = T2.z1
> In the resulting data set, the same value from T2.z2 will be multiplied to many values of T1.x1.
> 
> Since this value is very heavy- I am concerned whether the data is actually duplicated or whether there are internal optimisations that maintain only references?
> 
> p.s
> Originally posted on SO