You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by "yevgenypats (via GitHub)" <gi...@apache.org> on 2023/04/26 11:59:39 UTC

[GitHub] [arrow-site] yevgenypats opened a new pull request, #348: [Website]: Adopting Apache Arrow at CloudQuery

yevgenypats opened a new pull request, #348:
URL: https://github.com/apache/arrow-site/pull/348

   Hi Team, we wrote a case study about our journey and we decided to adopt arrow at [CloudQuery](https://github.com/cloudquery/cloudquery). Once reviewed and merged we would also like to cross post it on our website - similar to [duckdb post](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).
   
   Thank you! cc @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180827378


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.

Review Comment:
   ```suggestion
   2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support Arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending Arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181231897


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"

Review Comment:
   I don't have merge permissions but updated for today. mind merging it in ? :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181013411


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.

Review Comment:
   I can add flatbuffer. yes, we use both arrow arrays and types for data exchange. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181013544


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):

Review Comment:
   both, we send arrow records over the wire



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181239838


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"

Review Comment:
   Hi @yevgenypats  -- thank you for the update. Given the discussion on the mailing list is only a few days old I recommend we wait a few more days to see if there are additional comments.
   
   https://lists.apache.org/thread/mbbx6xzhp9zynkvncpw68j9f71p6gsbp
   
   I can handle merging this (and updating the date) when it is ready, perhaps sometime mid week



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181211671


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"

Review Comment:
   Before this PR is merged, please update this date to the actual publish date and rename the file accordingly.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181231814


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the [architecture](https://www.cloudquery.io/docs/developers/architecture) and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) Schema 2) Data that fits the defined schema (Arrow Arrays).
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined via flatbuffers in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support Arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending Arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization. 
+3. Rich Data Types: Arrow supports more than [35 types](https://arrow.apache.org/docs/python/api/datatypes.html) including composite types (i.e. lists, structs and maps of all the available types) and ability to extend the type system with custom types.

Review Comment:
   Nice added! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1533303822

   Unless anyone else would like to comment on this PR I plan to merge it tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1534599576

   Thanks! Published on our end and backlinked to this post as well. Exciting stuff! https://www.cloudquery.io/blog/adopting-apache-arrow-at-cloudquery 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180719712


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">

Review Comment:
   Yeah that's why I just pointed to the original hotlink. Should I change it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181013512


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).

Review Comment:
   Yes. I also pointed to our architecture docs for more info - https://www.cloudquery.io/docs/developers/architecture



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1527023157

   @zeroshade thx for the review! Can you please take another look at this one (applied all the fixes)? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180722088


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">

Review Comment:
   Cool, thanks. I think we should just hyperlink to https://xkcd.com/927/ as I suggested above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] wjones127 commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180858119


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).

Review Comment:
   So, if I understand correctly, another way of saying this is:
   
   > Sources and destinations each have their own way to represent the schema of some dataset. In order to avoid having to write a conversion between every pair of source and destination (which would be difficult to maintain), we choose an intermediate schema representation that all sources and destinations can be converted to and from.
   
   



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```

Review Comment:
   It would be nice to have a diagram showing where the schemas fit in the flow of things. We have a `img` folder where diagrams can be added if need be.



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):

Review Comment:
   > time spent in an ELT process is around converting data from one format to another
   
   again, the schemas? Or the actual arrays / data?



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.

Review Comment:
   It's a little unclear to me: do you use Arrow arrays at all, or just the types? And if it's just the data types, them maybe mentioning that Arrow uses flatbuffers to define its type system which is part of why exchanging them is performant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] zeroshade commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1178152277


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,63 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.

Review Comment:
   ```suggestion
   [CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
   ```



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,63 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what type system is and why it is needed in an ELT framework. At a very high level ELT framework extracts data from a source and moves it to a destination with a specific schema.

Review Comment:
   ```suggestion
   Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
   ```



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,63 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what type system is and why it is needed in an ELT framework. At a very high level ELT framework extracts data from a source and moves it to a destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial because this way we can add new destinations and update old destinations without updating source plugins code(which otherwise would introduce an unmaintainable architecture).

Review Comment:
   ```suggestion
   Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
   ```



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,63 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what type system is and why it is needed in an ELT framework. At a very high level ELT framework extracts data from a source and moves it to a destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial because this way we can add new destinations and update old destinations without updating source plugins code(which otherwise would introduce an unmaintainable architecture).
+
+This is where the type system comes in. Source plugin extracts information from APIs in the most performant way, defines a schema and then transforms the result from the API (JSON or any other format) to a well-defined type system so the destination plugin will be able to easily create the schema for its database and transform the data to the destination type. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.

Review Comment:
   ```suggestion
   This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
   ```



##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,63 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is one of the key components of a performant and scalable ELT framework where sources and destinations are decoupled. In this blog we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what type system is and why it is needed in an ELT framework. At a very high level ELT framework extracts data from a source and moves it to a destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial because this way we can add new destinations and update old destinations without updating source plugins code(which otherwise would introduce an unmaintainable architecture).
+
+This is where the type system comes in. Source plugin extracts information from APIs in the most performant way, defines a schema and then transforms the result from the API (JSON or any other format) to a well-defined type system so the destination plugin will be able to easily create the schema for its database and transform the data to the destination type. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache arrow defines a cross language columnar format for flat and hierarchical data, and brings the following advantages:

Review Comment:
   ```suggestion
   This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180612260


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">

Review Comment:
   Actually xkcd says [hotlinking is fine](https://xkcd.com/license.html) but they prefer for folks include a link to the original comic (https://xkcd.com/927/).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181211492


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the [architecture](https://www.cloudquery.io/docs/developers/architecture) and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) Schema 2) Data that fits the defined schema (Arrow Arrays).
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined via flatbuffers in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support Arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending Arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization. 
+3. Rich Data Types: Arrow supports more than [35 types](https://arrow.apache.org/docs/python/api/datatypes.html) including composite types (i.e. lists, structs and maps of all the available types) and ability to extend the type system with custom types.

Review Comment:
   A somewhat shameless self promotion is another advantage is that the mapping to / from the arrow type system and the parquet type system (including nested types) is supported by many of the arrow libraries as explained in 
   
   https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #348:
URL: https://github.com/apache/arrow-site/pull/348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180609629


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">

Review Comment:
   ItI think it is preferable to add this PNG file into the PR at the path `img/20230424-xkcd-standards.png` and change the `src` here to point to that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180831564


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
+
+<figure style="text-align: center;">
+  <img src="https://imgs.xkcd.com/comics/standards.png" width="100%" class="img-responsive" alt="Yet another standard XKCD">
+</figure>
+
+
+This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
+
+1. Cross-language with extensive libraries for different languages - The [format](https://arrow.apache.org/docs/format/Columnar.html) is defined in such way that you can parse it in any language and already has extensive support in C/C++, C#, Go, Java, JavaScript, Julia, Matlab, Python, R, Ruby and Rust (at the time of writing). For CloudQuery this is important as it makes it much easier to develop source or destination plugins in different languages.
+2. Performance: Arrow adoption is rising especially in columnar based databases ([DuckDB](https://duckdb.org/2021/12/03/duck-arrow.html), [ClickHouse](https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc), [BigQuery](https://cloud.google.com/bigquery/docs/samples/bigquerystorage-arrow-quickstart)) and file formats ([Parquet](https://arrow.apache.org/docs/python/parquet.html)) which makes it easier to write CloudQuery destination or source plugins for databases that already support arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.
+3. Rich Data Types: Arrow supports more than [35 types](https://arrow.apache.org/docs/python/api/datatypes.html) including composite types (i.e. lists, structs and maps of all the available types) and ability to extend the type system with custom types.
+
+# Summary
+
+Adopting Apache Arrow as the CloudQuery in-memory type system enables us to gain better performance, data interoperability and developer experience. Some plugins that are going to gain an immediate boost of rich type systems are our db->db replication plugins such as [PostgreSQL CDC](https://www.cloudquery.io/docs/plugins/sources/postgresql/overview) source plugin (and all [database destinations](https://www.cloudquery.io/docs/plugins/destinations/overview)) that are going to get support for all available types including nested ones.

Review Comment:
   ```suggestion
   Adopting Apache Arrow as the CloudQuery in-memory type system enables us to gain better performance, data interoperability and developer experience. Some plugins that are going to gain an immediate boost of rich type systems are our database-to-database replication plugins such as [PostgreSQL CDC](https://www.cloudquery.io/docs/plugins/sources/postgresql/overview) source plugin (and all [database destinations](https://www.cloudquery.io/docs/plugins/destinations/overview)) that are going to get support for all available types including nested ones.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1180721624


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) schema 2) data that fits the defined schema.
+
+# Why Arrow?
+
+Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):

Review Comment:
   This resolves my below comment
   ```suggestion
   Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this [famous XKCD](https://xkcd.com/927/) (by building yet another format):
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1534517659

   FYI when I was testing this page locally I found that the cloud query blog link is broken
   
   https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery
   ![Screenshot 2023-05-04 at 6 31 06 AM](https://user-images.githubusercontent.com/490673/236179830-9d1f69f1-503d-446d-ac3c-9b1c8116e247.png)
   
   I am assuming this will go live after the arrow site; If that is not the case, we can make a new PR to fix the post post publish.
   
   I will file a follow on PR to update the publish date. Thanks again everyone
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] wjones127 commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181266067


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-30 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the [architecture](https://www.cloudquery.io/docs/developers/architecture) and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```
+
+
+Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
+
+This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) Schema 2) Data that fits the defined schema (Arrow Arrays).

Review Comment:
   ```suggestion
   This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) the schema 2) the records that fit the defined schema. In Arrow terminology, these are a schema and a record batch.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1523952236

   Thanks for the review! I've added also a cross-post link.
   
   We don't have interesting performance benchmarks yet (might add another blog in the future focused on performance. We did some basic performance testing just to make sure it performance at least in same ballpark as what we had before).
   
   For us really the most important part was as mentioned in the blog is less the performance aspect (as long as it doesn't make performance worse then what we had with our in-house implementation of-course) and more 1) the language independent design as we are interested to expand [CloudQuery SDK ](https://github.com/cloudquery/plugin-sdk/) to other languages in the future and 2) The rich type system with nested types support and 3) that is already quite adopted by different solutions which makes it easier for us as an ELT framework.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on a diff in pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on code in PR #348:
URL: https://github.com/apache/arrow-site/pull/348#discussion_r1181013592


##########
_posts/2023-04-24-adopting-apache-arrow-at-cloudquery.md:
##########
@@ -0,0 +1,65 @@
+---
+layout: post
+title: "Adopting Apache Arrow at CloudQuery"
+date: "2023-04-26 00:00:00"
+author: Yevgeny Pats
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+This post is a collaboration with CloudQuery and cross-posted on the CloudQuery [blog](https://cloudquery.io/blog/adopting-apache-arrow-at-cloudquery).
+
+[CloudQuery](https://github.com/cloudquery/cloudquery) is an open source high performance ELT framework written in Go. We [previously](https://www.cloudquery.io/blog/building-cloudquery) discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
+
+# What is a Type System?
+
+Let’s quickly [recap](https://www.cloudquery.io/blog/building-cloudquery#type-system) what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
+
+```text
+API ---> [Source Plugin]  ----->    [Destination Plugin]
+                          ----->    [Destination Plugin]
+                           gRPC
+```

Review Comment:
   I pointed to our docs https://www.cloudquery.io/docs/developers/architecture but I can copy that img here as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1534553997

   The website is live: https://arrow.apache.org/blog/2023/05/04/adopting-apache-arrow-at-cloudquery/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] yevgenypats commented on pull request #348: [Website]: Adopting Apache Arrow at CloudQuery

Posted by "yevgenypats (via GitHub)" <gi...@apache.org>.

yevgenypats commented on PR #348:
URL: https://github.com/apache/arrow-site/pull/348#issuecomment-1535304124

   Also, tweeted - https://twitter.com/cloudqueryio/status/1654088502957555713 would love if you can either retweet or just publish on arrow twitter - it will help get some more visibility :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org