You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Benjamin Kim <bb...@gmail.com> on 2017/07/08 17:49:35 UTC

Glue-like Functionality

Has anyone seen AWS Glue? I was wondering if there is something similar going to be built into Spark Structured Streaming? I like the Data Catalog idea to store and track any data source/destination. It profiles the data to derive the scheme and data types. Also, it does some sort-of automated schema evolution when or if the schema changes. It leaves only the transformation logic to the ETL developer. I think some of this can enhance or simplify Structured Streaming. For example, AWS S3 can be catalogued as a Data Source; in Structured Streaming, Input DataFrame is created like a SQL view based off of the S3 Data Source; lastly, the Transform logic, if any, just manipulates the data going from the Input DataFrame to the Result DataFrame, which is another view based off of a catalogued Data Destination. This would relieve the ETL developer from caring about any Data Source or Destination. All server information, access credentials, data schemas, folder directory structures, file formats, and any other properties can be securely stored away with only a select few.

I'm just curious to know if anyone has thought the same thing.

Cheers,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Glue-like Functionality

Posted by Simon Kitching <si...@unbelievable-machine.com>.

Sounds similar to Confluent Kafka Schema Registry and Kafka Connect.

The Schema Registry and Kafka Connect themselves are open-source, but some of the datasource-specific adapters, and GUIs to manage it all, are not open-source (see Confluent Enterprise Edition).

Note that the Schema Registry and Kafka Connect are generic tools, and not spark-specific.

Regards, Simon

> Am 08.07.2017 um 19:49 schrieb Benjamin Kim <bb...@gmail.com>:
> 
> Has anyone seen AWS Glue? I was wondering if there is something similar going to be built into Spark Structured Streaming? I like the Data Catalog idea to store and track any data source/destination. It profiles the data to derive the scheme and data types. Also, it does some sort-of automated schema evolution when or if the schema changes. It leaves only the transformation logic to the ETL developer. I think some of this can enhance or simplify Structured Streaming. For example, AWS S3 can be catalogued as a Data Source; in Structured Streaming, Input DataFrame is created like a SQL view based off of the S3 Data Source; lastly, the Transform logic, if any, just manipulates the data going from the Input DataFrame to the Result DataFrame, which is another view based off of a catalogued Data Destination. This would relieve the ETL developer from caring about any Data Source or Destination. All server information, access credentials, data schemas, folder directory structures, file formats, and any other properties can be securely stored away with only a select few.
> 
> I'm just curious to know if anyone has thought the same thing.
> 
> Cheers,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org