You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bersprockets <gi...@git.apache.org> on 2018/05/02 04:51:08 UTC

[GitHub] spark pull request #21073: [SPARK-23936][SQL] Implement map_concat

Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21073#discussion_r185392954
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala ---
    @@ -116,6 +117,161 @@ case class MapValues(child: Expression)
       override def prettyName: String = "map_values"
     }
     
    +/**
    + * Returns the union of all the given maps.
    + */
    +@ExpressionDescription(
    +usage = "_FUNC_(map, ...) - Returns the union of all the given maps",
    +examples = """
    +    Examples:
    +      > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));
    +       [[1 -> "a"], [2 -> "c"], [3 -> "d"]
    +  """, since = "2.4.0")
    +case class MapConcat(children: Seq[Expression]) extends Expression {
    +
    +  override def checkInputDataTypes(): TypeCheckResult = {
    +    // check key types and value types separately to allow valueContainsNull to vary
    +    if (children.exists(!_.dataType.isInstanceOf[MapType])) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input of function $prettyName should all be of type map, " +
    +          "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else if (children.map(_.dataType.asInstanceOf[MapType].keyType).distinct.length > 1) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input maps of function $prettyName should all be the same type, " +
    +          "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else if (children.map(_.dataType.asInstanceOf[MapType].valueType).distinct.length > 1) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input maps of function $prettyName should all be the same type, " +
    +          "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else {
    +      TypeCheckResult.TypeCheckSuccess
    +    }
    +  }
    +
    +  override def dataType: MapType = {
    +    MapType(
    +      keyType = children.headOption
    +        .map(_.dataType.asInstanceOf[MapType].keyType).getOrElse(StringType),
    +      valueType = children.headOption
    +        .map(_.dataType.asInstanceOf[MapType].valueType).getOrElse(StringType),
    +      valueContainsNull = children.map { c =>
    +        c.dataType.asInstanceOf[MapType]
    +      }.exists(_.valueContainsNull)
    +    )
    +  }
    +
    +  override def nullable: Boolean = children.exists(_.nullable)
    +
    +  override def eval(input: InternalRow): Any = {
    +    val union = new util.LinkedHashMap[Any, Any]()
    +    children.map(_.eval(input)).foreach { raw =>
    +      if (raw == null) {
    +        return null
    +      }
    +      val map = raw.asInstanceOf[MapData]
    +      map.foreach(dataType.keyType, dataType.valueType, (k, v) =>
    +        union.put(k, v)
    +      )
    --- End diff --
    
    I found an issue. I was preparing to add some more tests when I noticed that using maps as keys doesn't work well in interpreted mode (seems to work fine in codegen mode, so far).
    
    So, something like this doesn't work in interpreted mode:
    <pre>
    scala> dfmapmap.show(truncate=false)
    +--------------------------------------------------+---------------------------------------------+
    |mapmap1                                           |mapmap2                                      |
    +--------------------------------------------------+---------------------------------------------+
    |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
    +--------------------------------------------------+---------------------------------------------+
    scala> dfmapmap.select(map_concat('mapmap1, 'mapmap2).as('mapmap3)).show(truncate=false)
    +-----------------------------------------------------------------------------------------------+
    |mapmap3                                                                                        |
    +-----------------------------------------------------------------------------------------------+
    |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
    +-----------------------------------------------------------------------------------------------+
    </pre>
    As you can see, the key `[1 -> 2, 3 -> 4]` shows up twice in the new map.
    
    This is because:
    <pre>
      val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4)))
      val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4)))
      a1 == a2 // will be false
      a1.hashCode() == a2.hashCode() // will be false
    </pre>
    Different instances of ArrayBasedMapData with the exact same data are not considered the same key.
    
    So far, seems to work fine for instances of UnsafeMapData (that is, the UnsafeMapData version of the above maps are considered the same key).
    
    I might need to add equals and/or hashCode methods to ArrayBasedMapData. Either that, or wrap any ArrayBasedMapData with some other object (only during its time as a key in the union Map) that does implement equals and/or hashCode properly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org