You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Jordan Thomas-Green <jj...@gmail.com> on 2020/04/20 23:11:20 UTC

Large public Beam projects?

Does anyone have any public repos/examples of larger Beam
projects/implementations that they've seen?

Re: Large public Beam projects?

Posted by Tim Robertson <ti...@gmail.com>.
My apologies, I missed the link:

[1] https://github.com/gbif/pipelines

On Tue, Apr 21, 2020 at 5:58 PM Tim Robertson <ti...@gmail.com>
wrote:

> Hi Jordan
>
> I don't know if we qualify as a large Beam project but at GBIF.org we
> bring together datasets from 1600+ institutions documenting 1,4B
> observations of species (museum data, citizen science, environmental
> reports etc).
> As far as Beam goes though, we aren't using the most advanced
> features. It's batch processing of data into Avro files stored on HDFS then
> into HBase / Elasticsearch.
>
> All our data and code [1] are open and I'm happy to discuss any aspect of
> it if it is helpful to you.
>
> Best wishes,
> Tim
>
> On Tue, Apr 21, 2020 at 3:48 PM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> Mozilla hosts the code for our data ingestion system publicly on GitHub.
>> A good chunk of that architecture consists of Beam pipelines running on
>> Dataflow.
>>
>> See:
>>
>> https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam
>>
>> and rendered usage documentation at:
>>
>> https://mozilla.github.io/gcp-ingestion/ingestion-beam/
>>
>> On Mon, Apr 20, 2020 at 7:11 PM Jordan Thomas-Green <jj...@gmail.com>
>> wrote:
>>
>>> Does anyone have any public repos/examples of larger Beam
>>> projects/implementations that they've seen?
>>>
>>

Re: Large public Beam projects?

Posted by Tim Robertson <ti...@gmail.com>.
Hi Jordan

I don't know if we qualify as a large Beam project but at GBIF.org we bring
together datasets from 1600+ institutions documenting 1,4B observations of
species (museum data, citizen science, environmental reports etc).
As far as Beam goes though, we aren't using the most advanced
features. It's batch processing of data into Avro files stored on HDFS then
into HBase / Elasticsearch.

All our data and code [1] are open and I'm happy to discuss any aspect of
it if it is helpful to you.

Best wishes,
Tim

On Tue, Apr 21, 2020 at 3:48 PM Jeff Klukas <jk...@mozilla.com> wrote:

> Mozilla hosts the code for our data ingestion system publicly on GitHub. A
> good chunk of that architecture consists of Beam pipelines running on
> Dataflow.
>
> See:
>
> https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam
>
> and rendered usage documentation at:
>
> https://mozilla.github.io/gcp-ingestion/ingestion-beam/
>
> On Mon, Apr 20, 2020 at 7:11 PM Jordan Thomas-Green <jj...@gmail.com>
> wrote:
>
>> Does anyone have any public repos/examples of larger Beam
>> projects/implementations that they've seen?
>>
>

Re: Large public Beam projects?

Posted by Jeff Klukas <jk...@mozilla.com>.
Mozilla hosts the code for our data ingestion system publicly on GitHub. A
good chunk of that architecture consists of Beam pipelines running on
Dataflow.

See:

https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam

and rendered usage documentation at:

https://mozilla.github.io/gcp-ingestion/ingestion-beam/

On Mon, Apr 20, 2020 at 7:11 PM Jordan Thomas-Green <jj...@gmail.com>
wrote:

> Does anyone have any public repos/examples of larger Beam
> projects/implementations that they've seen?
>