You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Brian Henriksen <Br...@humedica.com> on 2015/12/09 21:07:04 UTC

Exporting parquet, issues with schema

I am trying to use sqoop to export some parquet data to oracle from HDFS.  The first problem I ran into is that parquet export requires a .metadata directory that is created by a sqoop parquet IMPORT (Can anyone explain this to me, it seems odd to me that one can only send data to a database, that you just grabbed from a database).  I got around this by converting a small subset of my parquet data to text, sqoop export the text to oracle, and then sqoop import the data back to HDFS as parquet, and with it the .metadata directory.  Here is the error Im getting:



java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:917)
at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:54)
at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
at org.kitesdk.data.spi.AbstractKeyRecordReaderWrapper.initialize(AbstractKeyRecordReaderWrapper.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroup

This looks like sqoop is getting to the point of starting up the mappers, but they are not aware of my parquet / avro schema.  Where does sqoop look for these schemas?  As far as I know, parquet files include the schema within the data files themselves, in addition to this there is the .metadata directory that contains a .avsc JSON file with the same schema.  Any ideas?

Re: Exporting parquet, issues with schema

Posted by Brian Henriksen <Br...@humedica.com>.

I¹m actually not able to do a successful import either.

I get this error on trying to import:


2015-12-10 13:38:40,102 WARN
org.apache.sqoop.manager.oracle.OracleConnectionFactory: No Oracle
'session initialization' statements were found to execute.
Check that your oraoop-site-template.xml and/or oraoop-site.xml files are
correctly installed in the ${SQOOP_HOME}/conf directory.
2015-12-10 13:38:40,873 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2015-12-10 13:38:40,877 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.NoSuchMethodError:
parquet.avro.AvroParquetWriter.<init>(Lorg/apache/hadoop/fs/Path;Lorg/apach
e/avro/Schema;Lparquet/hadoop/metadata/CompressionCodecName;IIZLorg/apache/
hadoop/conf/Configuration;)V
at 
org.kitesdk.data.spi.filesystem.ParquetAppender.open(ParquetAppender.java:6
6)
at 
org.kitesdk.data.spi.filesystem.FileSystemWriter.initialize(FileSystemWrite
r.java:135)
at 
org.kitesdk.data.spi.filesystem.FileSystemView.newWriter(FileSystemView.jav
a:101)
at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.<init
>(DatasetKeyOutputFormat.java:308)
at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.getRecordWriter(DatasetKe
yOutputFormat.java:445)
at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.ja
va:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:653)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j
ava:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)


I am using parquet-avro-1.6.0.jar.  In this version there is no
constructor for AvroParquetWriter that matches the above error.  This
tells me that sqoop is expecting an older version of parquet.  Which
version is this and how can I provide to my sqoop job without interfering
with jobs that use the up-to-date version of parquet-avro.


On 12/10/15, 4:10 AM, "Jarek Jarcec Cecho" <jarcec@gmail.com on behalf of
jarcec@apache.org> wrote:

>Hi Brian,
>Sqoop uses library called Kite to work with parquet files. The library
>requires the .metadata directory and hence the dependency.
>
>A workaround if you need to export any arbitrary parquet directory is to
>call command  "kite create² on the directory first - this will create the
>required .metadata directory and you should be good to proceed with
>export.
>
>We¹re choosing different route in Sqoop 2, so this unfortunate need won¹t
>be there.
>
>Jarcec
>
>> On Dec 9, 2015, at 9:07 PM, Brian Henriksen
>><Br...@humedica.com> wrote:
>> 
>> I am trying to use sqoop to export some parquet data to oracle from
>>HDFS.  The first problem I ran into is that parquet export requires a
>>.metadata directory that is created by a sqoop parquet IMPORT (Can
>>anyone explain this to me, it seems odd to me that one can only send
>>data to a database, that you just grabbed from a database).  I got
>>around this by converting a small subset of my parquet data to text,
>>sqoop export the text to oracle, and then sqoop import the data back to
>>HDFS as parquet, and with it the .metadata directory.  Here is the error
>>Im getting:
>> 
>> 
>> java.lang.NullPointerException
>> at java.io.StringReader.<init>(StringReader.java:50)
>> at org.apache.avro.Schema$Parser.parse(Schema.java:917)
>> at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:54)
>> at 
>>parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetReco
>>rdReader.java:142)
>> at 
>>parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecord
>>Reader.java:118)
>> at 
>>parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:10
>>7)
>> at 
>>org.kitesdk.data.spi.AbstractKeyRecordReaderWrapper.initialize(AbstractKe
>>yRecordReaderWrapper.java:50)
>> at 
>>org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTa
>>sk.java:478)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at org.apache.hadoop.security.UserGroup
>> 
>> This looks like sqoop is getting to the point of starting up the
>>mappers, but they are not aware of my parquet / avro schema.  Where does
>>sqoop look for these schemas?  As far as I know, parquet files include
>>the schema within the data files themselves, in addition to this there
>>is the .metadata directory that contains a .avsc JSON file with the same
>>schema.  Any ideas?
>

Re: Exporting parquet, issues with schema

Posted by Jarek Jarcec Cecho <ja...@apache.org>.

Hi Brian,
Sqoop uses library called Kite to work with parquet files. The library requires the .metadata directory and hence the dependency.

A workaround if you need to export any arbitrary parquet directory is to call command  "kite create” on the directory first - this will create the required .metadata directory and you should be good to proceed with export.

We’re choosing different route in Sqoop 2, so this unfortunate need won’t be there.

Jarcec

> On Dec 9, 2015, at 9:07 PM, Brian Henriksen <Br...@humedica.com> wrote:
> 
> I am trying to use sqoop to export some parquet data to oracle from HDFS.  The first problem I ran into is that parquet export requires a .metadata directory that is created by a sqoop parquet IMPORT (Can anyone explain this to me, it seems odd to me that one can only send data to a database, that you just grabbed from a database).  I got around this by converting a small subset of my parquet data to text, sqoop export the text to oracle, and then sqoop import the data back to HDFS as parquet, and with it the .metadata directory.  Here is the error Im getting:
> 
> 
> java.lang.NullPointerException
> at java.io.StringReader.<init>(StringReader.java:50)
> at org.apache.avro.Schema$Parser.parse(Schema.java:917)
> at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:54)
> at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
> at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
> at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
> at org.kitesdk.data.spi.AbstractKeyRecordReaderWrapper.initialize(AbstractKeyRecordReaderWrapper.java:50)
> at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroup
> 
> This looks like sqoop is getting to the point of starting up the mappers, but they are not aware of my parquet / avro schema.  Where does sqoop look for these schemas?  As far as I know, parquet files include the schema within the data files themselves, in addition to this there is the .metadata directory that contains a .avsc JSON file with the same schema.  Any ideas?