You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Rahul Channe <dr...@googlemail.com> on 2016/08/23 19:48:52 UTC

Loading Sybase to hive using sqoop

Hi All,

We are trying to load data from Sybase Iq table to hive using sqoop. The
hive table is partitioned and expecting to hold 29M records per day.

The sqoop job takes 7 hours to load 15 days of data, even while setting the
direct load option to 6. Hive is using MR framework.

Is there is way to speed up the process.

Note - the aim is to load 1 year of data

Re: Loading Sybase to hive using sqoop

Posted by Jörn Franke <jo...@gmail.com>.

Is your Sybase server ready to deliver a large amount of data? (Network, memory, cpu, parallel access, resources etc) This is usually the problem when loading data from a relational database  and less sqoop / mr or spark. 
Then, you should have a recent Hive version and store in Orc or parquet compressed (snappy). Not in a text based format.
Another alternative would be to use one of the export tools supplied with Sybase and export as a compressed file, put the file on HDFS and load it into Hive. This makes only sense if the export tool by Sybase is outperforming a JDBC connection (can happen depending on the relational database).

> On 23 Aug 2016, at 21:48, Rahul Channe <dr...@googlemail.com> wrote:
> 
> Hi All,
> 
> We are trying to load data from Sybase Iq table to hive using sqoop. The hive table is partitioned and expecting to hold 29M records per day.
> 
> The sqoop job takes 7 hours to load 15 days of data, even while setting the direct load option to 6. Hive is using MR framework.
> 
> Is there is way to speed up the process.
> 
> Note - the aim is to load 1 year of data

Re: Loading Sybase to hive using sqoop

Posted by Mich Talebzadeh <mi...@gmail.com>.

Are you using a vendor distro or in-house build?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 22:08, Mich Talebzadeh <mi...@gmail.com>
wrote:

> hm. Watching paint dry :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 22:07, Rahul Channe <dr...@googlemail.com> wrote:
>
>> We are running hive on MR
>>
>>
>> On Wednesday, August 24, 2016, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> Sybase IQ uses jconn4.jar for ODBC connection. This is the way I use
>>> Spark to get IQ data into Hive table. You can specify partition in Sqoop as
>>> well.
>>>
>>> I started using Sqoop to populate Hive tables but decided to use Spark.
>>>
>>> Also are you running Hive on Map-reduce engine?
>>>
>>>   private var dbURL = "jdbc:sybase:Tds:rhes564:21000/SYB_IQ"
>>>   private var dbDatabase = null
>>>   private var dbUserName ="loader"
>>>   private var dbPassword = "xxxxxxxx"
>>>
>>>   private var lowerBoundValue = "1"
>>>   private var upperBoundValue = "100000000"
>>>   private var numPartitionsValue = "100"  // This is your partition
>>> number in Hive table
>>>   // Get data from IQ table
>>>
>>>  val d = HiveContext.read.format("jdbc").options(
>>>  Map("url" -> dbURL,
>>>  "dbtable" -> "dummy)",
>>>  "partitionColumn" -> partitionColumnName,
>>>  "lowerBound" -> lowerBoundValue,
>>>  "upperBound" -> upperBoundValue,
>>>  "numPartitions" -> numPartitionsValue,
>>>  "user" -> dbUserName,
>>>  "password" -> dbPassword)).load
>>> // Register it as a temp table
>>>   d.registerTempTable("tmp")
>>>
>>> Insert into Hive table
>>>
>>>  sqltext = """
>>>   INSERT INTO TABLE dummy
>>>   SELECT
>>>           ID
>>>         , CLUSTERED
>>>         , SCATTERED
>>>         , RANDOMISED
>>>         , RANDOM_STRING
>>>         , SMALL_VC
>>>         , PADDING
>>>   FROM tmp
>>>   """
>>>    HiveContext.sql(sqltext)
>>>
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 23 August 2016 at 20:48, Rahul Channe <dr...@googlemail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are trying to load data from Sybase Iq table to hive using sqoop.
>>>> The hive table is partitioned and expecting to hold 29M records per day.
>>>>
>>>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>>>> the direct load option to 6. Hive is using MR framework.
>>>>
>>>> Is there is way to speed up the process.
>>>>
>>>> Note - the aim is to load 1 year of data
>>>>
>>>
>>>
>

Re: Loading Sybase to hive using sqoop

Posted by Mich Talebzadeh <mi...@gmail.com>.

hm. Watching paint dry :)

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 22:07, Rahul Channe <dr...@googlemail.com> wrote:

> We are running hive on MR
>
>
> On Wednesday, August 24, 2016, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Sybase IQ uses jconn4.jar for ODBC connection. This is the way I use
>> Spark to get IQ data into Hive table. You can specify partition in Sqoop as
>> well.
>>
>> I started using Sqoop to populate Hive tables but decided to use Spark.
>>
>> Also are you running Hive on Map-reduce engine?
>>
>>   private var dbURL = "jdbc:sybase:Tds:rhes564:21000/SYB_IQ"
>>   private var dbDatabase = null
>>   private var dbUserName ="loader"
>>   private var dbPassword = "xxxxxxxx"
>>
>>   private var lowerBoundValue = "1"
>>   private var upperBoundValue = "100000000"
>>   private var numPartitionsValue = "100"  // This is your partition
>> number in Hive table
>>   // Get data from IQ table
>>
>>  val d = HiveContext.read.format("jdbc").options(
>>  Map("url" -> dbURL,
>>  "dbtable" -> "dummy)",
>>  "partitionColumn" -> partitionColumnName,
>>  "lowerBound" -> lowerBoundValue,
>>  "upperBound" -> upperBoundValue,
>>  "numPartitions" -> numPartitionsValue,
>>  "user" -> dbUserName,
>>  "password" -> dbPassword)).load
>> // Register it as a temp table
>>   d.registerTempTable("tmp")
>>
>> Insert into Hive table
>>
>>  sqltext = """
>>   INSERT INTO TABLE dummy
>>   SELECT
>>           ID
>>         , CLUSTERED
>>         , SCATTERED
>>         , RANDOMISED
>>         , RANDOM_STRING
>>         , SMALL_VC
>>         , PADDING
>>   FROM tmp
>>   """
>>    HiveContext.sql(sqltext)
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 23 August 2016 at 20:48, Rahul Channe <dr...@googlemail.com> wrote:
>>
>>> Hi All,
>>>
>>> We are trying to load data from Sybase Iq table to hive using sqoop. The
>>> hive table is partitioned and expecting to hold 29M records per day.
>>>
>>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>>> the direct load option to 6. Hive is using MR framework.
>>>
>>> Is there is way to speed up the process.
>>>
>>> Note - the aim is to load 1 year of data
>>>
>>
>>

Re: Loading Sybase to hive using sqoop

Posted by Rahul Channe <dr...@googlemail.com>.

Thank you all for valuable inputs

On Wednesday, August 24, 2016, Mich Talebzadeh <mi...@gmail.com>
wrote:

> If this is one off then Spark will do OK.
>
> Sybase IQ provides bcp that creates a flat file tab/comma separated and
> you can use that to extract IQ table and put it on HDFS and create an
> external table.
>
> This is of course is a one off.
>
> You can also use SRS (SAP Replication Server) to get the data out first
> time and sync Hive table with Sybase IQ table real time. You will need SRS
> SP 204 or above to make this work.
>
> Talk to your DBA if they can get SRS SP from Sybase for this purpose. I
> have done it many times. I think it is stable enough for this purpose.
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 22:35, Gopal Vijayaraghavan <gopalv@apache.org
> <javascript:_e(%7B%7D,'cvml','gopalv@apache.org');>> wrote:
>
>>
>>
>> > val d = HiveContext.read.format("jdbc").options(
>> ...
>> >> The sqoop job takes 7 hours to load 15 days of data, even while setting
>> >>the direct load option to 6. Hive is using MR framework.
>>
>> In generaly, the jdbc implementations tend to react rather badly to large
>> extracts like this - the throttling usually happens on the operational
>> database end rather than being a problem on the MR side.
>>
>>
>> Sqoop is good enough for a one-shot import, but doing it frequently is
>> best done by the database's own dump protocols, which are generally not
>> throttled similarly.
>>
>> Pinterest recently put out a document on how they do this
>>
>> https://engineering.pinterest.com/blog/tracker-ingesting-mys
>> ql-data-scale-p
>> art-1
>> <https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-part-1>
>>
>> +
>> https://engineering.pinterest.com/blog/tracker-ingesting-mys
>> ql-data-scale-p
>> art-2
>> <https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-part-2>
>>
>> More interesting continous ingestion reads directly off the replication
>> protocol write-ahead logs.
>>
>> https://github.com/Flipkart/MySQL-replication-listener/tree/
>> master/examples
>> /mysql2hdfs
>> <https://github.com/Flipkart/MySQL-replication-listener/tree/master/examples/mysql2hdfs>
>>
>> +
>> https://github.com/flipkart-incubator/storm-mysql
>>
>>
>> But all of these tend to be optimized to a database engine, while the JDBC
>> pipe tends to work slowly for all engines.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: Loading Sybase to hive using sqoop

Posted by Mich Talebzadeh <mi...@gmail.com>.

If this is one off then Spark will do OK.

Sybase IQ provides bcp that creates a flat file tab/comma separated and you
can use that to extract IQ table and put it on HDFS and create an external
table.

This is of course is a one off.

You can also use SRS (SAP Replication Server) to get the data out first
time and sync Hive table with Sybase IQ table real time. You will need SRS
SP 204 or above to make this work.

Talk to your DBA if they can get SRS SP from Sybase for this purpose. I
have done it many times. I think it is stable enough for this purpose.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 24 August 2016 at 22:35, Gopal Vijayaraghavan <go...@apache.org> wrote:

>
>
> > val d = HiveContext.read.format("jdbc").options(
> ...
> >> The sqoop job takes 7 hours to load 15 days of data, even while setting
> >>the direct load option to 6. Hive is using MR framework.
>
> In generaly, the jdbc implementations tend to react rather badly to large
> extracts like this - the throttling usually happens on the operational
> database end rather than being a problem on the MR side.
>
>
> Sqoop is good enough for a one-shot import, but doing it frequently is
> best done by the database's own dump protocols, which are generally not
> throttled similarly.
>
> Pinterest recently put out a document on how they do this
>
> https://engineering.pinterest.com/blog/tracker-ingesting-
> mysql-data-scale-p
> art-1
>
> +
> https://engineering.pinterest.com/blog/tracker-ingesting-
> mysql-data-scale-p
> art-2
>
> More interesting continous ingestion reads directly off the replication
> protocol write-ahead logs.
>
> https://github.com/Flipkart/MySQL-replication-listener/
> tree/master/examples
> /mysql2hdfs
>
> +
> https://github.com/flipkart-incubator/storm-mysql
>
>
> But all of these tend to be optimized to a database engine, while the JDBC
> pipe tends to work slowly for all engines.
>
> Cheers,
> Gopal
>
>
>

Re: Loading Sybase to hive using sqoop

Posted by Gopal Vijayaraghavan <go...@apache.org>.


> val d = HiveContext.read.format("jdbc").options(
...
>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>>the direct load option to 6. Hive is using MR framework.

In generaly, the jdbc implementations tend to react rather badly to large
extracts like this - the throttling usually happens on the operational
database end rather than being a problem on the MR side.


Sqoop is good enough for a one-shot import, but doing it frequently is
best done by the database's own dump protocols, which are generally not
throttled similarly.

Pinterest recently put out a document on how they do this

https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p
art-1

+
https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-p
art-2

More interesting continous ingestion reads directly off the replication
protocol write-ahead logs.

https://github.com/Flipkart/MySQL-replication-listener/tree/master/examples
/mysql2hdfs

+
https://github.com/flipkart-incubator/storm-mysql


But all of these tend to be optimized to a database engine, while the JDBC
pipe tends to work slowly for all engines.

Cheers,
Gopal

Re: Loading Sybase to hive using sqoop

Posted by Rahul Channe <dr...@googlemail.com>.

We are running hive on MR

On Wednesday, August 24, 2016, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Sybase IQ uses jconn4.jar for ODBC connection. This is the way I use Spark
> to get IQ data into Hive table. You can specify partition in Sqoop as well.
>
> I started using Sqoop to populate Hive tables but decided to use Spark.
>
> Also are you running Hive on Map-reduce engine?
>
>   private var dbURL = "jdbc:sybase:Tds:rhes564:21000/SYB_IQ"
>   private var dbDatabase = null
>   private var dbUserName ="loader"
>   private var dbPassword = "xxxxxxxx"
>
>   private var lowerBoundValue = "1"
>   private var upperBoundValue = "100000000"
>   private var numPartitionsValue = "100"  // This is your partition number
> in Hive table
>   // Get data from IQ table
>
>  val d = HiveContext.read.format("jdbc").options(
>  Map("url" -> dbURL,
>  "dbtable" -> "dummy)",
>  "partitionColumn" -> partitionColumnName,
>  "lowerBound" -> lowerBoundValue,
>  "upperBound" -> upperBoundValue,
>  "numPartitions" -> numPartitionsValue,
>  "user" -> dbUserName,
>  "password" -> dbPassword)).load
> // Register it as a temp table
>   d.registerTempTable("tmp")
>
> Insert into Hive table
>
>  sqltext = """
>   INSERT INTO TABLE dummy
>   SELECT
>           ID
>         , CLUSTERED
>         , SCATTERED
>         , RANDOMISED
>         , RANDOM_STRING
>         , SMALL_VC
>         , PADDING
>   FROM tmp
>   """
>    HiveContext.sql(sqltext)
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 23 August 2016 at 20:48, Rahul Channe <drahulc@googlemail.com
> <javascript:_e(%7B%7D,'cvml','drahulc@googlemail.com');>> wrote:
>
>> Hi All,
>>
>> We are trying to load data from Sybase Iq table to hive using sqoop. The
>> hive table is partitioned and expecting to hold 29M records per day.
>>
>> The sqoop job takes 7 hours to load 15 days of data, even while setting
>> the direct load option to 6. Hive is using MR framework.
>>
>> Is there is way to speed up the process.
>>
>> Note - the aim is to load 1 year of data
>>
>
>

Re: Loading Sybase to hive using sqoop

Posted by Mich Talebzadeh <mi...@gmail.com>.

Sybase IQ uses jconn4.jar for ODBC connection. This is the way I use Spark
to get IQ data into Hive table. You can specify partition in Sqoop as well.

I started using Sqoop to populate Hive tables but decided to use Spark.

Also are you running Hive on Map-reduce engine?

  private var dbURL = "jdbc:sybase:Tds:rhes564:21000/SYB_IQ"
  private var dbDatabase = null
  private var dbUserName ="loader"
  private var dbPassword = "xxxxxxxx"

  private var lowerBoundValue = "1"
  private var upperBoundValue = "100000000"
  private var numPartitionsValue = "100"  // This is your partition number
in Hive table
  // Get data from IQ table

 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load
// Register it as a temp table
  d.registerTempTable("tmp")

Insert into Hive table

 sqltext = """
  INSERT INTO TABLE dummy
  SELECT
          ID
        , CLUSTERED
        , SCATTERED
        , RANDOMISED
        , RANDOM_STRING
        , SMALL_VC
        , PADDING
  FROM tmp
  """
   HiveContext.sql(sqltext)


HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 23 August 2016 at 20:48, Rahul Channe <dr...@googlemail.com> wrote:

> Hi All,
>
> We are trying to load data from Sybase Iq table to hive using sqoop. The
> hive table is partitioned and expecting to hold 29M records per day.
>
> The sqoop job takes 7 hours to load 15 days of data, even while setting
> the direct load option to 6. Hive is using MR framework.
>
> Is there is way to speed up the process.
>
> Note - the aim is to load 1 year of data
>