You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Costin Leau <co...@gmail.com> on 2014/09/30 11:05:29 UTC

SparkSQL DataType mappings

Hi,

I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm having some issues with the SQL API, in
particular in what the DataTypes translate to.

1. A SchemaRDD is composed of a Row and StructType - I'm using the latter to decompose a Row into primitives. I'm not
clear however how to deal with _rich_ types, namely array, map and struct.
MapType gives me type information about the key and its value however what's the actual Map object? j.u.Map, scala.Map?
For example assuming row(0) has a MapType associated with it, to what do I cast row(0)?
Same goes for StructType; if row(1) has a StructType associated with it, do I cast the value to Row?

2. Similar to the above, I've noticed the Row interface has cast methods so ideally one should use
row(index).getFloat|Integer|Boolean etc... but I didn't see any methods for Binary or Decimal. Also the _rich_ types are
missing; I presume this is for pluggability reasons however whats the generic way to access/unwrap the generic
Any/Object in this case to the desired DataType?

3. On a separate note, for RDDs containing just values (think CSV,TSV files) is there an option to have a header
associated with it without having to wrap each row with a case class? As each entry has exactly the same structure, the
wrapping is just overhead that doesn't provide any extra information (you know the structure of one row, you know it for
all of them).

Thanks,

[1] github.com/elasticsearch/elasticsearch-hadoop
--
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SparkSQL DataType mappings

Posted by Costin Leau <co...@gmail.com>.

Hi Yin,

Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark 
SQL [1]

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html

On 10/2/14 6:32 PM, Yin Huai wrote:
> Hi Costin,
>
> I am answering your questions below.
>
> 1. You can find  Spark SQL data type reference at here
> <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>. It explains the underlying
> data type for a Spark SQL data type for Scala, Java, and Python APIs. For example, in Scala API, the underlying Scala
> type of MapType is scala.collection.Map. While, in Java API, it is java.util.Map. For StructType, yes, it should be cast
> to Row.
>
> 2. Interfaces like getFloat and getInteger are for primitive data types. For other types, you can access values by
> ordinal. For example, row(1). Right now, you have to cast values accessed by ordinal. Once
> https://github.com/apache/spark/pull/1759 is in, accessing values in a row will be much easier.
>
> 3. We are working on supporting CSV files (https://github.com/apache/spark/pull/1351). Right now, you can use our
> programatic APIs
> <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema> to create
> SchemaRDDs. Basically, you first define the schema (represented by a StructType) of the SchemaRDD. Then, convert your
> RDD (for example, RDD[String]) directly to RDD[Row]. Finally, use applySchema provided in SQLContext/HiveContext to
> apply the defined schema to the RDD[Row]. The return value of applySchema is the SchemaRDD you want.
>
> Thanks,
>
> Yin
>
> On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.leau@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm having some issues with the SQL API, in
>     particular in what the DataTypes translate to.
>
>     1. A SchemaRDD is composed of a Row and StructType - I'm using the latter to decompose a Row into primitives. I'm
>     not clear however how to deal with _rich_ types, namely array, map and struct.
>     MapType gives me type information about the key and its value however what's the actual Map object? j.u.Map, scala.Map?
>     For example assuming row(0) has a MapType associated with it, to what do I cast row(0)?
>     Same goes for StructType; if row(1) has a StructType associated with it, do I cast the value to Row?
>
>     2. Similar to the above, I've noticed the Row interface has cast methods so ideally one should use
>     row(index).getFloat|Integer|__Boolean etc... but I didn't see any methods for Binary or Decimal. Also the _rich_
>     types are missing; I presume this is for pluggability reasons however whats the generic way to access/unwrap the
>     generic Any/Object in this case to the desired DataType?
>
>     3. On a separate note, for RDDs containing just values (think CSV,TSV files) is there an option to have a header
>     associated with it without having to wrap each row with a case class? As each entry has exactly the same structure,
>     the wrapping is just overhead that doesn't provide any extra information (you know the structure of one row, you
>     know it for all of them).
>
>     Thanks,
>
>     [1] github.com/elasticsearch/__elasticsearch-hadoop <http://github.com/elasticsearch/elasticsearch-hadoop>
>     --
>     Costin
>
>     ------------------------------__------------------------------__---------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>
>

-- 
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SparkSQL DataType mappings

Posted by Yin Huai <hu...@gmail.com>.

Hi Costin,

I am answering your questions below.

1. You can find  Spark SQL data type reference at here
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>.
It explains the underlying data type for a Spark SQL data type for Scala,
Java, and Python APIs. For example, in Scala API, the underlying Scala type
of MapType is scala.collection.Map. While, in Java API, it is
java.util.Map. For StructType, yes, it should be cast to Row.

2. Interfaces like getFloat and getInteger are for primitive data types. For
other types, you can access values by ordinal. For example, row(1). Right
now, you have to cast values accessed by ordinal. Once
https://github.com/apache/spark/pull/1759 is in, accessing values in a row
will be much easier.

3. We are working on supporting CSV files (
https://github.com/apache/spark/pull/1351). Right now, you can use our
programatic
APIs
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema>
to
create SchemaRDDs. Basically, you first define the schema (represented by a
StructType) of the SchemaRDD. Then, convert your RDD (for example,
RDD[String]) directly to RDD[Row]. Finally, use applySchema provided in
SQLContext/HiveContext to apply the defined schema to the RDD[Row]. The
return value of applySchema is the SchemaRDD you want.

Thanks,

Yin

On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <co...@gmail.com> wrote:

> Hi,
>
> I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm
> having some issues with the SQL API, in particular in what the DataTypes
> translate to.
>
> 1. A SchemaRDD is composed of a Row and StructType - I'm using the latter
> to decompose a Row into primitives. I'm not clear however how to deal with
> _rich_ types, namely array, map and struct.
> MapType gives me type information about the key and its value however
> what's the actual Map object? j.u.Map, scala.Map?
> For example assuming row(0) has a MapType associated with it, to what do I
> cast row(0)?
> Same goes for StructType; if row(1) has a StructType associated with it,
> do I cast the value to Row?
>
> 2. Similar to the above, I've noticed the Row interface has cast methods
> so ideally one should use row(index).getFloat|Integer|Boolean etc... but
> I didn't see any methods for Binary or Decimal. Also the _rich_ types are
> missing; I presume this is for pluggability reasons however whats the
> generic way to access/unwrap the generic Any/Object in this case to the
> desired DataType?
>
> 3. On a separate note, for RDDs containing just values (think CSV,TSV
> files) is there an option to have a header associated with it without
> having to wrap each row with a case class? As each entry has exactly the
> same structure, the wrapping is just overhead that doesn't provide any
> extra information (you know the structure of one row, you know it for all
> of them).
>
> Thanks,
>
> [1] github.com/elasticsearch/elasticsearch-hadoop
> --
> Costin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>