You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/16 04:42:32 UTC

[GitHub] [spark] davidrabinowitz commented on pull request #30372: [SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator

davidrabinowitz commented on pull request #30372:
URL: https://github.com/apache/spark/pull/30372#issuecomment-727730750


   In order to verify it first you need to create a table in BigQuery in the following manner:
   ```
   bq load --source_format NEWLINE_DELIMITED_JSON <TABLE> vector_test.data.json vector_test.schema.json
   ```
   The files are:
   
   - vector_test.data.json:
   ```
   {"name":"row1","num":"1","vector":{"type":"1","indices":[],"values":[1,2,3]}}
   {"name":"row2","num":"2","vector":{"type":"1","indices":[],"values":[4,5,6]}}
   {"name":"row3","num":"3","vector":{"type":"1","indices":[],"values":[7,8,9]}}
   ```
   
   - vector_test.schema.json:
   ```
   [
     {
       "mode": "NULLABLE",
       "name": "name",
       "type": "STRING"
     },
     {
       "mode": "NULLABLE",
       "name": "num",
       "type": "INTEGER"
     },
     {
       "description": "{spark.type=vector}",
       "fields": [
         {
           "mode": "NULLABLE",
           "name": "type",
           "type": "INTEGER"
         },
         {
           "mode": "NULLABLE",
           "name": "size",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "indices",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "values",
           "type": "FLOAT"
         }
       ],
       "mode": "NULLABLE",
       "name": "vector",
       "type": "RECORD"
     }
   ]
   ```
   A GCP account is needed for that, but the amount of data and operation are well in the free tier.
   
   Run `spark-shell  --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.3` and enter the following commands:
   ```
   val df = spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("<TABLE>")
   df.schema()
   df.show()
   ```
   
   Notice that when the format is changed to `bigquery` another path is used which does not rely on the code generator and hence does not suffer from this issue.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org