You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by Gwen Shapira <gs...@cloudera.com> on 2014/08/06 22:12:14 UTC

Re: Review Request 22516: Support importing mainframe sequential datasets


> On July 10, 2014, 8:22 a.m., Venkat Ranganathan wrote:
> > src/java/org/apache/sqoop/manager/MainframeManager.java, line 75
> > <https://reviews.apache.org/r/22516/diff/1/?file=608148#file608148line75>
> >
> >     Is import into Hbase and Accumulo supported by this tool?  It looks like the only target supported is HDFS text files from the command help.
> 
> Mariappan Asokan wrote:
>     Each record in a mainframe dataset is treated as a single field (or column.)  So, theoretically HBase, Accumulo, and Hive are supported but with limited usability.  So, I did not add them in the documentation.  If you feel strongly that they should be documented, I can work on that in the next version of the patch.
> 
> Venkat Ranganathan wrote:
>     I feel it would be good to say we import only as text files and leave further processing, loading into hive/hbase upto the user as the composition of the records and needed processing differ and the schema can't be inferred.
> 
> Mariappan Asokan wrote:
>     I agree with you.  To avoid confusion, I plan to remove support for parsing input format, output format, hive, hbase, hcatalog, and codegen options.  This will synchronize the document with the code. What do you think?
>
> 
> Venkat Ranganathan wrote:
>     Sorry for the delay.   I was wondering whether the mainframe connector can just define connector specific extra args and not create another tool.   Please see NetezzaManager or DirectNetezzaManager as an example.   May be you have to invent a new synthetic  URI format say jdbc:mfftp:<host address>:<port>/dataset and choose your Connection Manager when --connect option with the above uri format is given.  That should simplify a whole lot in my opinion.   What do you think?
> 
> Mariappan Asokan wrote:
>     Thanks for your suggestions.  Sorry, I did not get back sooner.  In Sqoop 1.x, there is a strong assumption that input source is always a database table.  Due to this the sqoop import tool has many options that are relevant to a source database table.  A mainframe source is totally different from a database table.  I think it is better to create a separate tool for mainframe import rather than just a new connection manager.  The mainframe import tool will not support many options that the database import tool supports.  It will have its own options that the database import tool does not support.  At present, these are the host name and partitioned dataset name.  In the future, the mainframe import tool may be enhanced with metadata specific or connection specific arguments unique to mainframe.  Creating a synthetic URI for a connection seems to be somewhat artificial to me.
>     
>     Contrary to what I stated before, considering possible future enhancements, I think it is better to retain the support for parsing input format, output format, Hive, HBase, HCatalog, and codegen options.  The documentation will be enhanced in the future to reflect this support.
>
> 
> Venkat Ranganathan wrote:
>     Thanks for your thoughts on the suggestion.  As you correctly pointed out, Sqoop 1.x has a JDBC model (that is why you had to implement  a ConnectionManager and provide pseudo values for column types etc (always returning VARCHAR).   I understand there will be options mainframe import will not support (much like there are mysql specific options or netezza or sqlserver specific options).   I understand you want to have specific metadata for mainframe import.  That may be tricky.   Connection specific arguments can be implemented as how JDBC connection specific arguments are done.  
>     
>     The reason for my suggestion was primarily to piggy back on the implementation for imports into hive/hbase in future when you have the ability to provide specific metadata on the data.
>     You can definitely parse the various options, but you have to explicitly check and exit if the unsupported options are currently used.
>     
>     My only worry with this tool is that this may be one off for mainframe imports alone and we will be starting off with hdfs import only until you get to the rest of the parts and when we finally see this, it is basically duplicating some of the code and may be difficult to maintain,
>

I just checked the possibility of adding non-JDBC imports as part of the import tool, using fake connection URL as you suggested.
This is not feasible - ConnManager (which you need to inherit) has to implement getConnection, which returns java.sql.Connection. You can't return this connection object for an FTP. Same for readTable which must return a ResultSet. 

I think a separate tool is the only way to go.


- Gwen


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22516/#review47555
-----------------------------------------------------------


On June 14, 2014, 10:46 p.m., Mariappan Asokan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22516/
> -----------------------------------------------------------
> 
> (Updated June 14, 2014, 10:46 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Repository: sqoop-trunk
> 
> 
> Description
> -------
> 
> This is to move mainframe datasets to Hadoop.
> 
> 
> Diffs
> -----
> 
>   src/java/org/apache/sqoop/manager/MainframeManager.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetFTPRecordReader.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetImportMapper.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputFormat.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputSplit.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetRecordReader.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeImportJob.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/MainframeImportTool.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/SqoopTool.java dbe429a 
>   src/java/org/apache/sqoop/util/MainframeFTPClientUtils.java PRE-CREATION 
>   src/test/org/apache/sqoop/manager/TestMainframeManager.java PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetFTPRecordReader.java PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputFormat.java PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputSplit.java PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeImportJob.java PRE-CREATION 
>   src/test/org/apache/sqoop/tool/TestMainframeImportTool.java PRE-CREATION 
>   src/test/org/apache/sqoop/util/TestMainframeFTPClientUtils.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22516/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mariappan Asokan
> 
>