You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by da...@apache.org on 2024/03/04 19:02:16 UTC
(beam) branch master updated: Duet AI data encoding prompts (no links) (#30420)
This is an automated email from the ASF dual-hosted git repository.
damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new fb58cfc0cad Duet AI data encoding prompts (no links) (#30420)
fb58cfc0cad is described below
commit fb58cfc0cad657036fd09e6d33008b4c064a04ca
Author: Daria Bezkorovaina <99...@users.noreply.github.com>
AuthorDate: Mon Mar 4 19:02:08 2024 +0000
Duet AI data encoding prompts (no links) (#30420)
* Create 32_data_encoding.md
* Update 32_data_encoding.md
* Create 33_coders_data_encoding.md
* Update 33_coders_data_encoding.md
* Create 34_change_coders_data_encoding.md
* Update 34_change_coders_data_encoding.md
* Update 34_change_coders_data_encoding.md
* Update README.md
* Update 01_io_kafka.md
Nits to avoid passive voice
* Update 02_io_pubsub.md
* Update 02_io_pubsub.md
Nits and minor typos
* Update 02_io_pubsub.md
* Update 03_io_bigquery.md
* Update 04_io_bigtable.md
* Update 04_io_bigtable.md
* Update 05_io_spanner.md
* Update 06_io_tfrecord.md
* Update 07_io_json.md
* Update 08_io_csv.md
* Update 09_io_avro.md
* Update 10_io_parquet.md
* Update 11_io_jdbc.md
* Update 06_io_tfrecord.md
* Update 07_io_json.md
* Update 08_io_csv.md
* Update 09_io_avro.md
* Update 09_io_avro.md
* Update 10_io_parquet.md
* Update 11_io_jdbc.md
* Rename 33_coders_data_encoding.md to 48_coders_data_encoding.md
* Rename 34_change_coders_data_encoding.md to 49_change_coders_data_encoding.md
* Rename 48_coders_data_encoding.md to 34_coders_data_encoding.md
* Rename 34_coders_data_encoding.md to 33_coders_data_encoding.md
* Rename 49_change_coders_data_encoding.md to 34_change_coders_data_encoding.md
* Update learning/prompts/documentation-lookup-nolinks/34_change_coders_data_encoding.md
Implement PR review comments
Co-authored-by: Danny McCormick <da...@google.com>
---------
Co-authored-by: Danny McCormick <da...@google.com>
---
learning/prompts/README.md | 6 +-
learning/prompts/code-generation/01_io_kafka.md | 4 +-
learning/prompts/code-generation/02_io_pubsub.md | 17 ++--
learning/prompts/code-generation/03_io_bigquery.md | 7 +-
learning/prompts/code-generation/04_io_bigtable.md | 8 +-
learning/prompts/code-generation/05_io_spanner.md | 5 +-
learning/prompts/code-generation/06_io_tfrecord.md | 4 +-
learning/prompts/code-generation/07_io_json.md | 7 +-
learning/prompts/code-generation/08_io_csv.md | 6 +-
learning/prompts/code-generation/09_io_avro.md | 5 +-
learning/prompts/code-generation/10_io_parquet.md | 6 +-
learning/prompts/code-generation/11_io_jdbc.md | 7 +-
.../32_data_encoding.md | 15 +++
.../33_coders_data_encoding.md | 35 +++++++
.../34_change_coders_data_encoding.md | 103 +++++++++++++++++++++
15 files changed, 198 insertions(+), 37 deletions(-)
diff --git a/learning/prompts/README.md b/learning/prompts/README.md
index ea5d7d8bd79..b8cce71b794 100644
--- a/learning/prompts/README.md
+++ b/learning/prompts/README.md
@@ -48,7 +48,7 @@ Features of a good response:
- Starts with a brief introduction that explains the code sample.
- Includes information about how to find the reference documentation.
- Includes a link to the list of code samples.
-- Provides well documented code. Consider including an example of what the return result looks like.
+- Provides well-documented code. Consider including an example of what the execution result looks like.
- Follows up with the user to ensure they don’t continue needlessly with false responses.
@@ -56,7 +56,7 @@ Features of a good response:
Features of a good response:
- Starts with a short overall description that tries to answer the question in the prompt.
- Grounds the algorithm in any well-known context, if appropriate. For example, this is an implementation of X, a well-known algorithm to do Y.
-- Discusses the variables in the snippet, and what their purpose is relative to the runtime.
+- Discusses the variables in the snippet and their purpose relative to the runtime.
- Discusses runtime and memory storage complexity.
- Notes any interesting features of the code, or opportunities for improvement (optimizations, refactoring, syntax best practices, etc.)
@@ -73,4 +73,4 @@ learning/prompts/
│ ├── 01_basic_learning_apache_beam.md
│ └── ...
└── README.md
-```
\ No newline at end of file
+```
diff --git a/learning/prompts/code-generation/01_io_kafka.md b/learning/prompts/code-generation/01_io_kafka.md
index e169e012ef4..9b225c85638 100644
--- a/learning/prompts/code-generation/01_io_kafka.md
+++ b/learning/prompts/code-generation/01_io_kafka.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from Apache Kafka using Apache Beam.
Response:
-You can read data from [Apache Kafka](https://kafka.apache.org/) using the `ReadFromKafka` transform. The following Python code reads a topic `topic` from a Kafka server `bootstrap_servers`. The values for `topic` and `bootstrap_servers` are provided as command-line arguments. The data is logged to the console.
+Your Apache Beam pipeline can read data from [Apache Kafka](https://kafka.apache.org/) using the `ReadFromKafka` transform. The following Python code reads a topic `topic` from a Kafka server specified in `bootstrap_servers`. The values for `topic` and `bootstrap_servers` are provided as command-line arguments. The Apache Beam pipeline reads the data from a Kafka topic and logs it to the console.
```python
import logging
@@ -43,4 +43,4 @@ with beam.Pipeline(options=options) as p:
```
For more information about how to use the KafkaIO connector with the Python SDK for Apache Beam, see the [KafkaIO connector documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.kafka.html).
-For samples that show common pipeline configurations to parse pipeline options from command-line arguments, see [Pipeline option patterns](https://beam.apache.org/documentation/patterns/pipeline-options/).
\ No newline at end of file
+For samples that show common pipeline configurations to parse pipeline options from command-line arguments, see [Pipeline option patterns](https://beam.apache.org/documentation/patterns/pipeline-options/).
diff --git a/learning/prompts/code-generation/02_io_pubsub.md b/learning/prompts/code-generation/02_io_pubsub.md
index 90856e8956f..d5e2e2d833f 100644
--- a/learning/prompts/code-generation/02_io_pubsub.md
+++ b/learning/prompts/code-generation/02_io_pubsub.md
@@ -1,9 +1,10 @@
Prompt:
-Write the python code to read messages from a Pub/Sub subscription.
+Write the Python code to read messages from a Pub/Sub subscription.
+
Response:
-You can read messages from a Pub/Sub subscription or topic using the `ReadFromPubSub` transform. Pub/Sub is only supported in streaming pipelines.
+Your Apache Beam pipeline can read messages from a Pub/Sub subscription or topic using the `ReadFromPubSub` transform. Pub/Sub is only supported in streaming pipelines.
-The following Python code reads messages from a Pub/Sub subscription. The subscription is provided as a command line argument. The messages are logged to the console:
+The following Python code reads messages from a Pub/Sub subscription and logs the data to the console. The subscription is provided as a command line argument:
```python
import logging
@@ -14,9 +15,8 @@ from apache_beam.io import ReadFromPubSub
from apache_beam.options.pipeline_options import PipelineOptions
class PubSubReadOptions(PipelineOptions):
-"""
-Configure pipeline options for PubSub read transform.
-"""
+"""Configure pipeline options for Pub/Sub read transform."""
+
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument(
@@ -42,10 +42,11 @@ if __name__ == '__main__':
read_subscription()
```
-Reading messages directly from a topic is also supported. A temporary subscription will be created automatically.
+
+Apache Beam also supports reading messages directly from a topic. A temporary subscription will be created automatically.
The messages could be returned as a byte string or as `PubsubMessage` objects. This behavior is controlled by the `with_attributes` parameter.
For more information, see the [Pub/Sub I/O transform documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.pubsub.html).
-For a common pattern for configuring pipeline options see the [Pipeline option pattern](https://beam.apache.org/documentation/patterns/pipeline-options/).
\ No newline at end of file
+For a common pattern for configuring pipeline options, see [Pipeline option patterns](https://beam.apache.org/documentation/patterns/pipeline-options/).
diff --git a/learning/prompts/code-generation/03_io_bigquery.md b/learning/prompts/code-generation/03_io_bigquery.md
index b13be4fd355..b31b434ccdf 100644
--- a/learning/prompts/code-generation/03_io_bigquery.md
+++ b/learning/prompts/code-generation/03_io_bigquery.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from BigQuery using Apache Beam.
Response:
-You can read data from BigQuery using the `ReadFromBigQuery` transform. The following Python code reads data from a BigQuery table. The table name is provided as the command-line argument `table`. The data is logged to the console.
+Your Apache Beam pipeline can read data from BigQuery using the `ReadFromBigQuery` transform. The following Python code reads data from a BigQuery table and logs it to the console. The table name is provided as the command-line argument `table`.
```python
import logging
@@ -39,7 +39,8 @@ if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
read_table()
```
-The `ReadFromBigQuery` transform also supports reading from a BigQuery query. The following Python code reads data from a BigQuery table using a SQL query. The query is provided as the command-line argument `query`. The data is logged to the console.
+
+The `ReadFromBigQuery` transform also supports reading from a BigQuery query. The following Python code reads data from a BigQuery table using a SQL query and logs it to the console. The query is provided as the command-line argument `query`.
```python
with beam.Pipeline(options=options) as p:
@@ -48,7 +49,7 @@ with beam.Pipeline(options=options) as p:
| Map(logging.info)
```
-For more information, see the [BigQueryIO connector documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html).
+For more information, see the [BigQuery I/O connector documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html).
For samples that show common pipeline configurations, see [Pipeline option patterns](https://beam.apache.org/documentation/patterns/pipeline-options/).
diff --git a/learning/prompts/code-generation/04_io_bigtable.md b/learning/prompts/code-generation/04_io_bigtable.md
index 22ea5ebcef9..e4b09153a69 100644
--- a/learning/prompts/code-generation/04_io_bigtable.md
+++ b/learning/prompts/code-generation/04_io_bigtable.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from Bigtable using Apache Beam.
Response:
-You can read data from [Cloud Bigtable](https://cloud.google.com/bigtable) using the `ReadFromBigtable` transform. The following Python code reads a table `table_id` from a Bigtable instance `instance_id`. The `table_id` and `instance_id` are provided as command-line arguments. The data is logged to the console.
+Your Apache Beam pipeline can read data from [Cloud Bigtable](https://cloud.google.com/bigtable) using the `ReadFromBigtable` transform. The following Python code reads a table `table_id` from a Bigtable instance `instance_id` and logs the data to the console. The `table_id` and `instance_id` are provided as command-line arguments.
```python
import logging
@@ -44,7 +44,7 @@ def run():
options = BigtableOptions()
with beam.Pipeline(options=options) as p:
- #this pipeline reads from a BigQuery table and logs the data to the console
+ #this pipeline reads from a Bigtable table and logs the data to the console
output = (p | "Read from Bigtable" >> ReadFromBigtable(
project_id=options.project_id,
instance_id=options.instance_id,
@@ -58,9 +58,9 @@ if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
```
-The `ReadFromBigtable` transform returns a `PCollection` of `PartialRowData` objects, each representing a Bigtable row. For more information about this row object, see [PartialRowData (row_key)](https://cloud.google.com/python/docs/reference/bigtable/latest/row#class-googlecloudbigtablerowpartialrowdatarowkey)
+The `ReadFromBigtable` transform returns a `PCollection` of `PartialRowData` objects, each representing a Bigtable row. For more information about this row object, see [PartialRowData (row_key)](https://cloud.google.com/python/docs/reference/bigtable/latest/row#class-googlecloudbigtablerowpartialrowdatarowkey).
-For more information, see the [BigTableIO connector documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigtableio.html).
+For more information, see the [BigTable I/O connector documentation](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigtableio.html).
For samples that show common pipeline configurations, see [Pipeline option patterns](https://beam.apache.org/documentation/patterns/pipeline-options/).
diff --git a/learning/prompts/code-generation/05_io_spanner.md b/learning/prompts/code-generation/05_io_spanner.md
index 24453ff01be..991347a7ef3 100644
--- a/learning/prompts/code-generation/05_io_spanner.md
+++ b/learning/prompts/code-generation/05_io_spanner.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from Cloud Spanner using Apache Beam.
Response:
-You can read data from [Cloud Spanner](https://cloud.google.com/spanner) using the Apache Beam [SpannerIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.spanner.html). The SpannerIO connector lets you read data from and write data to Cloud Spanner. Cloud Spanner is supported in both batch pipelines and streaming pipelines.
+Your Apache Beam pipeline can read data from [Cloud Spanner](https://cloud.google.com/spanner) using the Apache Beam [SpannerIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.spanner.html). The SpannerIO connector lets you read data from and write data to Cloud Spanner. Cloud Spanner is supported in both batch pipelines and streaming pipelines.
```python
import logging
@@ -55,9 +55,10 @@ with beam.Pipeline(options=options) as p:
The `ReadFromSpanner` transform is a built-in Apache Beam transform that reads data from a Cloud Spanner table. The `ReadFromSpanner` transform returns a `PCollection` of `NamedTuple` objects. The `NamedTuple` object is a Python class that represents a row in a Cloud Spanner table.
Registering a coder for `NamedTuple` is required to use `NamedTuple` as a row type:
+
```python
coders.registry.register_coder(ExampleRow, coders.RowCoder)
```
For more information about how to register a coder for a custom type, see [Data encoding and type safety](https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety).
-The `SpannerOptions` class defines the command-line arguments `project_id`, `instance_id` and `database_id`, which are used to configure the `ReadFromSpanner` transform. These arguments are parsed from the command line using [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/).
\ No newline at end of file
+The `SpannerOptions` class defines the command-line arguments `project_id`, `instance_id`, and `database_id`, which are used to configure the `ReadFromSpanner` transform. These arguments are parsed from the command line using [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/).
diff --git a/learning/prompts/code-generation/06_io_tfrecord.md b/learning/prompts/code-generation/06_io_tfrecord.md
index 7f36540b97b..873d75d759c 100644
--- a/learning/prompts/code-generation/06_io_tfrecord.md
+++ b/learning/prompts/code-generation/06_io_tfrecord.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from a TensorFlow TFRecord file using Apache Beam.
Response:
-You can read data from a TensorFlow [TFRecord file](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset) using the Apache Beam [TFRecordIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.tfrecordio.html). The TFRecordIO connector lets you read and write data in `TFRecord` format.
+Your Apache Beam pipeline can read data from a TensorFlow [TFRecord file](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset) using the Apache Beam [TFRecordIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.tfrecordio.html). The TFRecordIO connector lets you read and write data in `TFRecord` format.
```python
import logging
@@ -55,6 +55,6 @@ if __name__ == "__main__":
The `TFRecordOptions` class defines the command-line argument `file_pattern`, which is used to configure the `ReadFromTFRecord` transform. This argument is parsed from the command line using [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/).
-The `ReadFromTFRecord` transform is a built-in Apache Beam transform that reads data from a `TFRecord` file. By default, `TFRecordIO` transforms use `coders.BytesCoder()`. For more information, see [ReadFromTFRecord transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.tfrecordio.html#apache_beam.io.tfrecordio.ReadFromTFRecord).
+The `ReadFromTFRecord` transform is a built-in Apache Beam transform that reads data from a `TFRecord` file. By default, `TFRecordIO` transforms use `coders.BytesCoder()`. For more information, see [`ReadFromTFRecord` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.tfrecordio.html#apache_beam.io.tfrecordio.ReadFromTFRecord).
diff --git a/learning/prompts/code-generation/07_io_json.md b/learning/prompts/code-generation/07_io_json.md
index 6c142b49c45..6abe29e94ab 100644
--- a/learning/prompts/code-generation/07_io_json.md
+++ b/learning/prompts/code-generation/07_io_json.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from JSON files using Apache Beam.
Response:
-You can read data from [JSON files](https://www.json.org/) using the Apache Beam [TextIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html).
+Your Apache Beam pipeline can read data from [JSON files](https://www.json.org/) using the Apache Beam [TextIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html).
```python
import logging
@@ -42,7 +42,8 @@ if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
```
-The `JsonOptions` class is a custom class that extends the `PipelineOptions` object. It is used to parse pipeline options from command-line arguments. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) to parse command-line arguments.
-The [ReadFromJson](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromJson) transform is a built-in Apache Beam transform that reads data from a JSON file.
+The `JsonOptions` class is a custom class that extends the `PipelineOptions` object. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) to parse command-line arguments.
+
+The [`ReadFromJson`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromJson) transform is a built-in Apache Beam transform that reads data from a JSON file.
diff --git a/learning/prompts/code-generation/08_io_csv.md b/learning/prompts/code-generation/08_io_csv.md
index 3888e15bb70..7db9f0bf2af 100644
--- a/learning/prompts/code-generation/08_io_csv.md
+++ b/learning/prompts/code-generation/08_io_csv.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from CSV files using Apache Beam.
Response:
-You can read data from [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) using the Apache Beam [TextIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html).
+Your Apache Beam pipeline can read data from [CSV files](https://en.wikipedia.org/wiki/Comma-separated_values) using the [TextIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html).
```python
import logging
@@ -42,6 +42,6 @@ if __name__ == "__main__":
run()
```
-The `CsvOptions` class is a custom class that extends the `PipelineOptions` object. It is used to parse pipeline options from command-line arguments. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) to parse command-line arguments.
+The `CsvOptions` class is a custom class that extends the `PipelineOptions` object. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) to parse command-line arguments.
-The [ReadFromCsv transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromCsv) is a built-in Apache Beam transform that reads data from a CSV file.
\ No newline at end of file
+The [`ReadFromCsv` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromCsv) is a built-in Apache Beam transform that reads data from a CSV file.
diff --git a/learning/prompts/code-generation/09_io_avro.md b/learning/prompts/code-generation/09_io_avro.md
index 2ccf5427822..5721bf3c301 100644
--- a/learning/prompts/code-generation/09_io_avro.md
+++ b/learning/prompts/code-generation/09_io_avro.md
@@ -2,7 +2,7 @@ Prompt:
Write the Python code to read data from Apache Avro files using Apache Beam.
Response:
-You can read data from [Apache Avro](https://avro.apache.org/) files using the Apache Beam [AvroIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.avroio.html).
+Your Apache Beam pipeline can read data from [Apache Avro](https://avro.apache.org/) files using the [AvroIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.avroio.html).
```python
import logging
@@ -35,8 +35,9 @@ if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
```
+
The `ReadAvroOptions` class is used to define the command-line argument `--path`, which specifies the path to the Avro file. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) for the required `path` argument.
-A Beam pipeline is created using the `ReadAvroOptions` class and the [ReadFromAvro transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.avroio.html#apache_beam.io.avroio.ReadFromAvro) is used to read data from the Avro file.
+The code creates an Apache Beam pipeline using the `ReadAvroOptions` class to set the Avro file path and the [`ReadFromAvro` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.avroio.html#apache_beam.io.avroio.ReadFromAvro) to read data from the file.
diff --git a/learning/prompts/code-generation/10_io_parquet.md b/learning/prompts/code-generation/10_io_parquet.md
index 5493c5e14d5..d84f9f3e74f 100644
--- a/learning/prompts/code-generation/10_io_parquet.md
+++ b/learning/prompts/code-generation/10_io_parquet.md
@@ -2,7 +2,8 @@ Prompt:
Write the Python code to read data from Apache Parquet files using Apache Beam.
Response:
-You can read data from [Apache Parquet](https://parquet.apache.org/) files using the Apache Beam [ParquetIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.parquetio.html).
+Your Apache Beam pipeline can read data from [Apache Parquet](https://parquet.apache.org/) files using the [ParquetIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.parquetio.html).
+
```python
import logging
@@ -34,6 +35,7 @@ if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run()
```
+
The `ReadParquetOptions` class is used to define the command-line argument `--path`, which specifies the path to the Parquet file. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) for the required `path` argument.
-A Beam pipeline is created using the `ReadParquetOptions` class and the [ReadFromParquet transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.parquetio.html#apache_beam.io.parquetio.ReadFromParquet) is used to read data from the Parquet file.
+The code creates an Apache Beam pipeline using the `ReadParquetOptions` class to set the Parquet file path and the [`ReadFromParquet` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.parquetio.html#apache_beam.io.parquetio.ReadFromParquet) to read data from the file.
diff --git a/learning/prompts/code-generation/11_io_jdbc.md b/learning/prompts/code-generation/11_io_jdbc.md
index 9d8a98d6e64..b833e90eeef 100644
--- a/learning/prompts/code-generation/11_io_jdbc.md
+++ b/learning/prompts/code-generation/11_io_jdbc.md
@@ -2,7 +2,8 @@ Prompt:
Write the Python code to read data from Oracle JDBC using Apache Beam.
Response:
-You can read data from an Oracle Java Database Connectivity (JDBC) enabled database using the Apache Beam [JdbcIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html).
+Your Apache Beam pipeline can read data from an Oracle Java Database Connectivity (JDBC) enabled database using the [JdbcIO connector](https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html).
+
```python
import logging
import apache_beam as beam
@@ -70,6 +71,6 @@ if __name__ == "__main__":
run()
```
-The `JdbcOptions` class is a custom class that extends the `PipelineOptions` object. It is used to parse pipeline options from command-line arguments. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/).
+The `JdbcOptions` class is a custom class that extends the `PipelineOptions` object. This code uses [pipeline options](https://beam.apache.org/documentation/patterns/pipeline-options/) to parse command line arguments.
-A Beam pipeline is created using the `JdbcOptions` class and the [ReadFromJdbc transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html#apache_beam.io.jdbc.ReadFromJdbc) is used to read data from the database via JDBC.
+The code creates an Apache Beam pipeline using the `JdbcOptions` class to set the JDBC connection information and the [`ReadFromJdbc` transform](https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html#apache_beam.io.jdbc.ReadFromJdbc) to read data from the database via JDBC.
diff --git a/learning/prompts/documentation-lookup-nolinks/32_data_encoding.md b/learning/prompts/documentation-lookup-nolinks/32_data_encoding.md
new file mode 100644
index 00000000000..34617ac965e
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/32_data_encoding.md
@@ -0,0 +1,15 @@
+Prompt:
+How does data encoding work in the Apache Beam model and why is it needed?
+
+Response:
+In Apache Beam, data encoding is a fundamental aspect of processing data in distributed environments. The model operates on the principle of encoding each element of a `PCollection` as a byte string.
+
+Data encoding is necessary to pass elements efficiently between distributed workers during the execution of an Apache Beam pipeline. Apache Beam is designed for distributed processing, where a pipeline may be executed on multiple VMs (workers) concurrently. During pipeline execution, workers collaborate to process data and often need to materialize the intermediate data in your `PCollection`. Communication between workers involves the exchange of encoded data over a network. The process [...]
+
+Apache Beam offers a data encoding mechanism with built-in support for commonly used types, as well as the flexibility to specify custom encodings.
+
+The Beam SDKs use the `Coder` objects to define how elements in a `PCollection` are encoded and decoded. These objects ensure type consistency and safety, guaranteeing that the encoding and decoding processes correctly interpret data based on its original data types.
+
+Every `PCollection` in an Apache Beam pipeline requires a corresponding `Coder`. In most cases, the Apache Beam SDKs can automatically infer a `Coder` for a `PCollection` based on its element type or the transform that produces it. However, in some cases, the pipeline author will need to specify a `Coder` explicitly or develop a `Coder` for their custom type.
+
+Apache Beam comes with many inbuilt `Coder` objects that work with a variety of standard data types. Users can also build custom coders or register a new default coder for a given type using the `CoderRegistry` object in their pipeline.
diff --git a/learning/prompts/documentation-lookup-nolinks/33_coders_data_encoding.md b/learning/prompts/documentation-lookup-nolinks/33_coders_data_encoding.md
new file mode 100644
index 00000000000..72d6cbba987
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/33_coders_data_encoding.md
@@ -0,0 +1,35 @@
+Prompt:
+What is a coder in Apache Beam?
+
+Response:
+In Apache Beam, a `Coder` is an object responsible for specifying how elements in a given `PCollection` are encoded and decoded into byte strings. The `Coder` objects provide type consistency and safety during distributed processing between workers, ensuring the correct interpretation of data based on its original types.
+
+Every `PCollection` in an Apache Beam pipeline requires a corresponding `Coder`. While the Apache Beam SDKs can automatically infer a `Coder` for a `PCollection` based on its element type or the producing transform, there are cases where explicit specification or custom `Coder` development is necessary. It is important to note that multiple `Coder` objects can exist for a single data type.
+
+The Apache Beam SDKs use various mechanisms to automatically infer the `Coder` for a `PCollection`. Each pipeline object contains a `CoderRegistry` object representing a mapping of language types to the default coder for those types.
+
+In the Apache Beam SDKs for Python and Java, the `Coder` type provides the necessary methods for encoding and decoding data. The SDKs offer various `Coder` subclasses working with standard Python and Java types, available in the `apache_beam.coders` package for Python and the `Coder` package for Java.
+
+By default, the Beam SDKs use the typehints (Python) or the type parameters (Java) from the transform's function object (such as `DoFn`) to infer the `Coder` for elements in a `PCollection`. For example, in Apache Beam Python SDK, a `ParDo` annotated with the typehints `@beam.typehints.with_input_types(int)` and `@beam.typehints.with_output_types(str)` indicates that it accepts `int` inputs and produces `str` outputs. The Python SDK automatically infers the default `Coder` for the output [...]
+
+You can use the `CoderRegistry` to look up the default coder for a given type or register a new default coder. The following tables show the default mappings of coders to standard types for any pipeline created using the Beam SDK for Java and Python:
+
+| **Java Type** | **Default Coder** |
+|---------------|-------------------|
+| Double | DoubleCoder |
+| Instant | InstantCoder |
+| Integer | VarIntCoder |
+| Iterable | IterableCoder |
+| KV | KvCoder |
+| List | ListCoder |
+| Map | MapCoder |
+| Long | VarLongCoder |
+| String | StringUtf8Coder |
+
+| **Python Type** | **Default Coder** |
+|-----------------|-------------------|
+| int | VarIntCoder |
+| float | FloatCoder |
+| str | BytesCoder |
+| bytes | StrUtf8Coder |
+| Tuple | TupleCoder |
diff --git a/learning/prompts/documentation-lookup-nolinks/34_change_coders_data_encoding.md b/learning/prompts/documentation-lookup-nolinks/34_change_coders_data_encoding.md
new file mode 100644
index 00000000000..3c3c4915630
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/34_change_coders_data_encoding.md
@@ -0,0 +1,103 @@
+Prompt:
+How to get, set, and create new coders for my Apache Beam pipeline?
+
+Response:
+Every `PCollection` in an Apache Beam pipeline requires a corresponding `Coder`. In most cases, the Apache Beam SDKs can automatically infer a `Coder` for a `PCollection` based on its element type or the producing transform. However, in some instances, you may need to explicitly set a `Coder` or create a custom `Coder`.
+
+In the Apache Beam SDKs for Python and Java, the `Coder` type provides the necessary methods for encoding and decoding data. To get, set, or register a coder for a particular pipeline, you can access and modify the pipeline’s `CoderRegistry` object.
+
+The following examples demonstrate how to get, set, and create a new `Coder` in an Apache Beam pipeline using the Python and Java SDKs.
+
+**Python SDK:**
+
+In the Python SDK, you can use the following methods:
+* `coders.registry`: retrieves the pipeline’s `CoderRegistry` object.
+* `CoderRegistry.get_coder`: retrieves the default `Coder` for a type.
+* `CoderRegistry.register_coder`: sets a new `Coder` for the target type.
+
+Here is an example illustrating how to set the default `Coder` in the Python SDK:
+
+```python
+apache_beam.coders.registry.register_coder(int, BigEndianIntegerCoder)
+```
+
+The provided example sets a default `Coder`, specifically `BigEndianIntegerCoder`, for `int` values in the pipeline.
+
+For custom or complex nested data types, you can implement a custom coder for your pipeline. To create a new `Coder`, you need to define a class that inherits from `Coder` and implement the required methods:
+* `encode`: takes input values and encodes them into byte strings.
+* `decode`: decodes the encoded byte string into its corresponding object.
+* `is_deterministic`: specifies whether this coder encodes values deterministically or not. A deterministic coder produces the same encoded representation of a given object every time, even if it is called on different workers at different moments. The method returns `True` or `False` based on your implementation.
+
+Here is an example of a custom `Coder` implementation in the Python SDK:
+
+```python
+from apache_beam.coders import Coder
+
+class CustomCoder(Coder):
+ def encode(self, value):
+ # Implementation for encoding 'value' into byte strings
+ pass
+
+ def decode(self, encoded):
+ # Implementation for decoding byte strings into the original object
+ pass
+
+ def is_deterministic(self):
+ # Specify whether this coder produces deterministic encodings
+ return True # or False based on your implementation
+```
+
+**Java SDK:**
+
+In the Java SDK, you can use the following methods:
+* `Pipeline.getCoderRegistry`: retrieves the pipeline’s `CoderRegistry` object.
+* `getCoder`: retrieves the coder for an existing `PCollection`.
+* `CoderRegistry.getCoder`: retrieves the default `Coder` for a type.
+* `CoderRegistry.registerCoder`: sets a new default `Coder` for the target type.
+
+Here is an example of how you can set the default ‘Coder’ in the Java SDK:
+
+```java
+PipelineOptions options = PipelineOptionsFactory.create();
+Pipeline p = Pipeline.create(options);
+
+CoderRegistry cr = p.getCoderRegistry();
+cr.registerCoder(Integer.class, BigEndianIntegerCoder.class);
+```
+
+In this example, you use the method `CoderRegistry.registerCoder` to register `BigEndianIntegerCoder` for the target `integer` type.
+
+For custom or complex nested data types, you can implement a custom coder for your pipeline. For this, the `Coder` class exposes the following key methods:
+* `encode`: takes input values and encodes them into byte strings.
+* `decode`: decodes the encoded byte string into its corresponding object.
+* `verifyDeterministic`: specifies whether this coder produces deterministic encodings. A deterministic coder produces the same encoded representation of a given object every time, even if it is called on different workers at different moments. The method will return `NonDeterministicException` if a coder is not deterministic.
+
+Here’s an example of a custom `Coder` implementation in the Java SDK:
+
+```java
+import org.apache.beam.sdk.coders.CoderException;
+import org.apache.beam.sdk.coders.StructuredCoder;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+public class CustomCoder extends StructuredCoder<YourType> {
+ @Override
+ public void encode(YourType value, OutputStream outStream) throws CoderException, IOException {
+ // Implementation for encoding 'value' into byte strings
+ }
+
+ @Override
+ public YourType decode(InputStream inStream) throws CoderException, IOException {
+ // Implementation for decoding byte strings into the original object
+ }
+
+ @Override
+ public void verifyDeterministic() throws NonDeterministicException {
+ // Specify whether this coder produces deterministic encodings
+ // Throw NonDeterministicException if not deterministic
+ }
+}
+```
+
+Replace `YourType` with the actual type for which you want to create a new `Coder`, and implement the necessary methods based on your encoding/decoding logic.