You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by "Prabeesh K." <pr...@gmail.com> on 2017/05/03 13:30:54 UTC

Re: BigQuery join in Apache beam

Hi Dan,

Sorry for the late response.

I agreed with you for the use cases that you mentioned.

Advice me and please share if there is any sample code to join two data
sets in Beam that are sharing some common keys.

Regards,
Prabeesh K.

On 6 February 2017 at 10:38, Dan Halperin <dh...@google.com> wrote:

> Definitely, using BigQuery for what BigQuery is really good at (big scans
> and cost-based joins) is nearly always a good idea. A strong endorsement of
> Ankur's answer.
>
> Pushing the right amount of work into a database is an art, however --
> there are some scenarios where you'd rather scan in BQ and join in Beam
> because the join result is very large and you can better filter it in Beam,
> or because you need to do some pre-join-filtering based on an external API
> call (and you don't want to load the results of that API call into
> BigQuery)...
>
> I've only seen a few, rare, cases of the latter.
>
> Thanks,
> Dan
>
> On Sun, Feb 5, 2017 at 9:19 PM, Prabeesh K. <pr...@gmail.com> wrote:
>
>> Hi Ankur,
>>
>> Thank you for your response.
>>
>> On 5 February 2017 at 23:59, Ankur Chauhan <an...@malloc64.com> wrote:
>>
>>> I have found doing joins in bigquery using sql is a lot faster and
>>> easier to iterate upon.
>>>
>>>
>>> Ankur Chauhan
>>> On Sat, Feb 4, 2017 at 22:05 Prabeesh K. <ma...@prabeeshk.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Which is the better way to join two tables in apache beam?
>>>>
>>>> Regards,
>>>> Prabeesh K.
>>>>
>>>
>>
>

Re: BigQuery join in Apache beam

Posted by "Prabeesh K." <pr...@gmail.com>.
Hi Dan

Thank you for your prompt reply.

Regards,
Prabeesh K.

On 3 May 2017 at 19:23, Dan Halperin <dh...@google.com> wrote:

> Hi Prabeesh,
>
> The underlying Beam primitive you use for Join is CoGroupByKey – this
> takes N different collections KV<K, V1> , KV<K, V2> , ... K<K, VN> and
> produces one collection KV<K, [Iterable<V1>, Iterable<V2>, ...,
> Iterable<VN>]>. This is a compressed representation of a Join result, in
> that you can expand it to a full outer join, you can implement inner join,
> and you can implement lots of other join algorithms.
>
> There is also a Join library that does this under the hood:
> https://github.com/apache/beam/tree/master/sdks/
> java/extensions/join-library
>
> Dan
>
> On Wed, May 3, 2017 at 6:30 AM, Prabeesh K. <pr...@gmail.com> wrote:
>
>> Hi Dan,
>>
>> Sorry for the late response.
>>
>> I agreed with you for the use cases that you mentioned.
>>
>> Advice me and please share if there is any sample code to join two data
>> sets in Beam that are sharing some common keys.
>>
>> Regards,
>> Prabeesh K.
>>
>> On 6 February 2017 at 10:38, Dan Halperin <dh...@google.com> wrote:
>>
>>> Definitely, using BigQuery for what BigQuery is really good at (big
>>> scans and cost-based joins) is nearly always a good idea. A strong
>>> endorsement of Ankur's answer.
>>>
>>> Pushing the right amount of work into a database is an art, however --
>>> there are some scenarios where you'd rather scan in BQ and join in Beam
>>> because the join result is very large and you can better filter it in Beam,
>>> or because you need to do some pre-join-filtering based on an external API
>>> call (and you don't want to load the results of that API call into
>>> BigQuery)...
>>>
>>> I've only seen a few, rare, cases of the latter.
>>>
>>> Thanks,
>>> Dan
>>>
>>> On Sun, Feb 5, 2017 at 9:19 PM, Prabeesh K. <pr...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ankur,
>>>>
>>>> Thank you for your response.
>>>>
>>>> On 5 February 2017 at 23:59, Ankur Chauhan <an...@malloc64.com> wrote:
>>>>
>>>>> I have found doing joins in bigquery using sql is a lot faster and
>>>>> easier to iterate upon.
>>>>>
>>>>>
>>>>> Ankur Chauhan
>>>>> On Sat, Feb 4, 2017 at 22:05 Prabeesh K. <ma...@prabeeshk.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Which is the better way to join two tables in apache beam?
>>>>>>
>>>>>> Regards,
>>>>>> Prabeesh K.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: BigQuery join in Apache beam

Posted by Dan Halperin <dh...@google.com>.
Hi Prabeesh,

The underlying Beam primitive you use for Join is CoGroupByKey – this takes
N different collections KV<K, V1> , KV<K, V2> , ... K<K, VN> and produces
one collection KV<K, [Iterable<V1>, Iterable<V2>, ..., Iterable<VN>]>. This
is a compressed representation of a Join result, in that you can expand it
to a full outer join, you can implement inner join, and you can implement
lots of other join algorithms.

There is also a Join library that does this under the hood:
https://github.com/apache/beam/tree/master/sdks/java/extensions/join-library


Dan

On Wed, May 3, 2017 at 6:30 AM, Prabeesh K. <pr...@gmail.com> wrote:

> Hi Dan,
>
> Sorry for the late response.
>
> I agreed with you for the use cases that you mentioned.
>
> Advice me and please share if there is any sample code to join two data
> sets in Beam that are sharing some common keys.
>
> Regards,
> Prabeesh K.
>
> On 6 February 2017 at 10:38, Dan Halperin <dh...@google.com> wrote:
>
>> Definitely, using BigQuery for what BigQuery is really good at (big scans
>> and cost-based joins) is nearly always a good idea. A strong endorsement of
>> Ankur's answer.
>>
>> Pushing the right amount of work into a database is an art, however --
>> there are some scenarios where you'd rather scan in BQ and join in Beam
>> because the join result is very large and you can better filter it in Beam,
>> or because you need to do some pre-join-filtering based on an external API
>> call (and you don't want to load the results of that API call into
>> BigQuery)...
>>
>> I've only seen a few, rare, cases of the latter.
>>
>> Thanks,
>> Dan
>>
>> On Sun, Feb 5, 2017 at 9:19 PM, Prabeesh K. <pr...@gmail.com> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thank you for your response.
>>>
>>> On 5 February 2017 at 23:59, Ankur Chauhan <an...@malloc64.com> wrote:
>>>
>>>> I have found doing joins in bigquery using sql is a lot faster and
>>>> easier to iterate upon.
>>>>
>>>>
>>>> Ankur Chauhan
>>>> On Sat, Feb 4, 2017 at 22:05 Prabeesh K. <ma...@prabeeshk.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Which is the better way to join two tables in apache beam?
>>>>>
>>>>> Regards,
>>>>> Prabeesh K.
>>>>>
>>>>
>>>
>>
>