You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by cjdc <cr...@cern.ch> on 2014/11/28 09:41:57 UTC

Spark SQL 1.0.0 - RDD from snappy compress avro file

Hi everyone,

I am using Spark 1.0.0 and I am facing some issues with handling binary
snappy compressed avro files which I get form HDFS. I know there are
improved mechanisms to handle these files on more recent version of Spark,
but updating is not an option since I am operating on a Cloudera cluster
with no admin privileges.

I would simply like to get some of these avro files, create de RDD and then
do simple SQL queries to their content.
By following Spark SQL 1.0.0 Programming Guide, we have:

*/val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val myData = sc.textFile("/example/mydir/MyFile1.avro")
### QUESTION ###
### How to dynamically define the schema from the Avro header?? ###
#
# val Schema = 


myData.registerAsTable("MyDB")

val query = sql("SELECT * FROM MyDB")
query.collect().foreach(println)/*

so, how would you modify this to make it work (considering the Spark
version)?

Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by cjdc <cr...@cern.ch>.

Ideas?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by cjdc <cr...@cern.ch>.

btw the same error from above also happen on 1.1.0 (just tested)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by cjdc <cr...@cern.ch>.

Hi Vikas and Simone,

thanks for the replies.
Yeah I understand this would be easier with 1.2 but this is completely out
of my control. I really have to work with 1.0.0.

About Simone's approach, during the imports I get:
/scala> import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
AvroKeyOutputFormat }
<console>:17: error: object mapreduce is not a member of package
org.apache.avro
       import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,
AvroKeyOutputFormat }
                              ^

scala> import org.apache.avro.mapred.AvroKey
<console>:17: error: object mapred is not a member of package
org.apache.avro
       import org.apache.avro.mapred.AvroKey
                              ^
scala> import com.twitter.chill.avro.AvroSerializer
<console>:18: error: object avro is not a member of package
com.twitter.chill
       import com.twitter.chill.avro.AvroSerializer
                                ^/






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by Simone Franzini <ca...@gmail.com>.

Did you have a look at my reply in this thread?

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html

I am using 1.1.0 though, so not sure if that code would work entirely with
1.0.0, but you can try.


Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Sat, Nov 29, 2014 at 5:43 AM, Vikas Agarwal <vi...@infoobjects.com>
wrote:

> Just in case it helps: https://github.com/databricks/spark-avro
>
> On Fri, Nov 28, 2014 at 8:48 PM, cjdc <cr...@cern.ch> wrote:
>
>> To make it simpler, for now forget the snappy compression. Just assume
>> they
>> are binary Avro files...
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20008.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Regards,
> Vikas Agarwal
> 91 – 9928301411
>
> InfoObjects, Inc.
> Execution Matters
> http://www.infoobjects.com
> 2041 Mission College Boulevard, #280
> Santa Clara, CA 95054
> +1 (408) 988-2000 Work
> +1 (408) 716-2726 Fax
>
>

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by Vikas Agarwal <vi...@infoobjects.com>.

Just in case it helps: https://github.com/databricks/spark-avro

On Fri, Nov 28, 2014 at 8:48 PM, cjdc <cr...@cern.ch> wrote:

> To make it simpler, for now forget the snappy compression. Just assume they
> are binary Avro files...
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20008.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc.
Execution Matters
http://www.infoobjects.com
2041 Mission College Boulevard, #280
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

Posted by cjdc <cr...@cern.ch>.

To make it simpler, for now forget the snappy compression. Just assume they
are binary Avro files...





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org