You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/08/27 10:10:58 UTC

[jira] [Commented] (SPARK-2721) Fix MapType compatibility issues with reading Parquet datasets

    [ https://issues.apache.org/jira/browse/SPARK-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112011#comment-14112011 ] 

Michael Armbrust commented on SPARK-2721:
-----------------------------------------

I think this was fixed by [SPARK-3036].  Please reopen if you are still having problems.

> Fix MapType compatibility issues with reading Parquet datasets
> --------------------------------------------------------------
>
>                 Key: SPARK-2721
>                 URL: https://issues.apache.org/jira/browse/SPARK-2721
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.0.1
>            Reporter: Robbie Russo
>             Fix For: 1.1.0
>
>
> Parquet-thrift (along with most likely other implementations of parquet) supports null values in a map and this makes any thrift generated parquet files that contain a map unreadable by spark sql due to the following code in parquet-thrift for generating the schema for maps:
> {code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
>   @Override
>   public void visit(ThriftType.MapType mapType) {
>     final ThriftField mapKeyField = mapType.getKey();
>     final ThriftField mapValueField = mapType.getValue();
>     //save env for map
>     String mapName = currentName;
>     Type.Repetition mapRepetition = currentRepetition;
>     //=========handle key
>     currentFieldPath.push(mapKeyField);
>     currentName = "key";
>     currentRepetition = REQUIRED;
>     mapKeyField.getType().accept(this);
>     Type keyType = currentType;//currentType is the already converted type
>     currentFieldPath.pop();
>     //=========handle value
>     currentFieldPath.push(mapValueField);
>     currentName = "value";
>     currentRepetition = OPTIONAL;
>     mapValueField.getType().accept(this);
>     Type valueType = currentType;
>     currentFieldPath.pop();
>     if (keyType == null && valueType == null) {
>       currentType = null;
>       return;
>     }
>     if (keyType == null && valueType != null)
>       throw new ThriftProjectionException("key of map is not specified in projection: " + currentFieldPath);
>     //restore Env
>     currentName = mapName;
>     currentRepetition = mapRepetition;
>     currentType = ConversionPatterns.mapType(currentRepetition, currentName,
>             keyType,
>             valueType);
>   }
> {code}
> Which causes an error on the spark side when we reach this step in the toDataType function that asserts that both the key and value are of repetition level REQUIRED:
> {code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
>         case ParquetOriginalType.MAP => {
>           assert(
>             !groupType.getFields.apply(0).isPrimitive,
>             "Parquet Map type malformatted: expected nested group for map!")
>           val keyValueGroup = groupType.getFields.apply(0).asGroupType()
>           assert(
>             keyValueGroup.getFieldCount == 2,
>             "Parquet Map type malformatted: nested group should have 2 (key, value) fields!")
>           val keyType = toDataType(keyValueGroup.getFields.apply(0))
>           println("here")
>           assert(keyValueGroup.getFields.apply(0).getRepetition == Repetition.REQUIRED)
>           val valueType = toDataType(keyValueGroup.getFields.apply(1))
>           assert(keyValueGroup.getFields.apply(1).getRepetition == Repetition.REQUIRED)
>           new MapType(keyType, valueType)
>         }
> {code}
> Currently I have modified parquet-thrift to use repetition REQUIRED just to make spark sql able to work on the parquet files since we don't actually use null values in our maps. However it would be preferred to use parquet-thrift and spark sql out of the box and have them work nicely together with our existing thrift data types without having to modify dependencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org