You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Chaim Turkel <ch...@behalf.com> on 2018/10/21 09:17:52 UTC

2 tier input

hi,
  I have the following flow i need to implement.
From the bigquery i run a query and get a list of id's then i need to
load from mongo all the documents based on these id's and export them
as an xml file.
How do you suggest i go about doing this?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures.

Re: 2 tier input

Posted by Lukasz Cwik <lc...@google.com>.

Yes this will change. Apache Beam has been working towards a general
solution to make all IO connectors become modular[1]. This would allow you
to read from an arbitrary number of sources chaining the output from one to
the next.

1: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html

On Mon, Oct 29, 2018 at 9:57 AM Chaim Turkel <ch...@behalf.com> wrote:

> Both solutions mean that i cannot use the beam IO classes that will be
> me the distribution, but i would have to get the data myself using a
> ParDo method, is this something that will change in the future? i
> understand that spark has a push down method that will pass the filter
> to the next level of querys.
> chaim
> On Mon, Oct 22, 2018 at 4:02 PM Jeff Klukas <jk...@mozilla.com> wrote:
> >
> > Chaim - If the full list of IDs is able to fit comfortably in memory and
> the Mongo collection is small enough that you can read the whole
> collection, you may want to fetch the IDs into a Java collection using the
> BigQuery API directly, then turn them into a Beam PCollection using
> Create.of(collection_of_ids). You could then use MongoDbIO.read() to read
> the entire collection, but throw out rows based on the side input of IDs.
> >
> > If the list of IDs is particularly small, you could fetch the collection
> into memory and parse that into a string filter that you pass to
> MongoDbIO.read() to specify which documents to fetch, avoiding the need for
> a side input.
> >
> > Otherwise, if it's a large number of IDs, you may need to use Beam's
> BigQueryIO to create a PCollection for the IDs, and then pass that into a
> ParDo with a custom DoFn that issues Mongo queries for a batch of IDs. I'm
> not very familiar with Mongo APIs, but you'd need to give the DoFn a
> connection to Mongo that's serializable. You could likely look at the
> implementation of MongoDbIO for inspiration there.
> >
> > On Sun, Oct 21, 2018 at 5:18 AM Chaim Turkel <ch...@behalf.com> wrote:
> >>
> >> hi,
> >>   I have the following flow i need to implement.
> >> From the bigquery i run a query and get a list of id's then i need to
> >> load from mongo all the documents based on these id's and export them
> >> as an xml file.
> >> How do you suggest i go about doing this?
> >>
> >> chaim
> >>
> >> --
> >>
> >>
> >> Loans are funded by
> >> FinWise Bank, a Utah-chartered bank located in Sandy,
> >> Utah, member FDIC, Equal
> >> Opportunity Lender. Merchant Cash Advances are
> >> made by Behalf. For more
> >> information on ECOA, click here
> >> <https://www.behalf.com/legal/ecoa/>. For important information about
> >> opening a new
> >> account, review Patriot Act procedures here
> >> <https://www.behalf.com/legal/patriot/>.
> >> Visit Legal
> >> <https://www.behalf.com/legal/> to
> >> review our comprehensive program terms,
> >> conditions, and disclosures.
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>

Re: 2 tier input

Posted by Chaim Turkel <ch...@behalf.com>.

Both solutions mean that i cannot use the beam IO classes that will be
me the distribution, but i would have to get the data myself using a
ParDo method, is this something that will change in the future? i
understand that spark has a push down method that will pass the filter
to the next level of querys.
chaim
On Mon, Oct 22, 2018 at 4:02 PM Jeff Klukas <jk...@mozilla.com> wrote:
>
> Chaim - If the full list of IDs is able to fit comfortably in memory and the Mongo collection is small enough that you can read the whole collection, you may want to fetch the IDs into a Java collection using the BigQuery API directly, then turn them into a Beam PCollection using Create.of(collection_of_ids). You could then use MongoDbIO.read() to read the entire collection, but throw out rows based on the side input of IDs.
>
> If the list of IDs is particularly small, you could fetch the collection into memory and parse that into a string filter that you pass to MongoDbIO.read() to specify which documents to fetch, avoiding the need for a side input.
>
> Otherwise, if it's a large number of IDs, you may need to use Beam's BigQueryIO to create a PCollection for the IDs, and then pass that into a ParDo with a custom DoFn that issues Mongo queries for a batch of IDs. I'm not very familiar with Mongo APIs, but you'd need to give the DoFn a connection to Mongo that's serializable. You could likely look at the implementation of MongoDbIO for inspiration there.
>
> On Sun, Oct 21, 2018 at 5:18 AM Chaim Turkel <ch...@behalf.com> wrote:
>>
>> hi,
>>   I have the following flow i need to implement.
>> From the bigquery i run a query and get a list of id's then i need to
>> load from mongo all the documents based on these id's and export them
>> as an xml file.
>> How do you suggest i go about doing this?
>>
>> chaim
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> <https://www.behalf.com/legal/ecoa/>. For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> <https://www.behalf.com/legal/patriot/>.
>> Visit Legal
>> <https://www.behalf.com/legal/> to
>> review our comprehensive program terms,
>> conditions, and disclosures.

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
<https://www.behalf.com/legal/ecoa/>. For important information about 
opening a new
account, review Patriot Act procedures here 
<https://www.behalf.com/legal/patriot/>.
Visit Legal 
<https://www.behalf.com/legal/> to
review our comprehensive program terms, 
conditions, and disclosures.

Re: 2 tier input

Posted by Jeff Klukas <jk...@mozilla.com>.

Chaim - If the full list of IDs is able to fit comfortably in memory and
the Mongo collection is small enough that you can read the whole
collection, you may want to fetch the IDs into a Java collection using the
BigQuery API directly, then turn them into a Beam PCollection using
Create.of(collection_of_ids). You could then use MongoDbIO.read() to read
the entire collection, but throw out rows based on the side input of IDs.

If the list of IDs is particularly small, you could fetch the collection
into memory and parse that into a string filter that you pass to
MongoDbIO.read() to specify which documents to fetch, avoiding the need for
a side input.

Otherwise, if it's a large number of IDs, you may need to use Beam's
BigQueryIO to create a PCollection for the IDs, and then pass that into a
ParDo with a custom DoFn that issues Mongo queries for a batch of IDs. I'm
not very familiar with Mongo APIs, but you'd need to give the DoFn a
connection to Mongo that's serializable. You could likely look at the
implementation of MongoDbIO for inspiration there.

On Sun, Oct 21, 2018 at 5:18 AM Chaim Turkel <ch...@behalf.com> wrote:

> hi,
>   I have the following flow i need to implement.
> From the bigquery i run a query and get a list of id's then i need to
> load from mongo all the documents based on these id's and export them
> as an xml file.
> How do you suggest i go about doing this?
>
> chaim
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>