You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Sandeep Reddy P <sa...@gmail.com> on 2012/09/07 16:18:47 UTC

How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking
very long time to work on each csv which are 12GB each. What is the best
way to do this?



-- 
Thanks,
sandeep

Re: How to load csv data into HIVE

Posted by praveenesh kumar <pr...@gmail.com>.

Yup, Bejoy is correct :-) Just use hadoop streaming, for what it can do
best --->>> Cleaning, Transformations and Validations, in just simple steps.

Regards,
Praveenesh

On Sat, Sep 8, 2012 at 6:03 PM, Bejoy KS <be...@yahoo.com> wrote:

> Hi Chuck
>
> I believe Praveenesh was adding his thought to the discussion on
> preprocessing the data using mapreduce itself. If you go with hadoop
> streaming you can use the python script in the mapper and that will do the
> preprocessing parallely on large volume data. Then this preprocessed data
> can be loaded into hive table.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * "Connell, Chuck" <Ch...@nuance.com>
> *Date: *Sat, 8 Sep 2012 12:18:33 +0000
> *To: *user@hive.apache.org<us...@hive.apache.org>
> *ReplyTo: * user@hive.apache.org
> *Subject: *RE: How to load csv data into HIVE
>
> I would like to hear more about this "hadoop streaming to Hive" idea. I
> have used streaming jobs as mappers, with a python script as map.py. Are
> you saying that such a streaming mapper can load its output into Hive? Can
> you send some example code? Hive wants to load "files" not individual
> lines/records. How would you do this?
>
> Thanks very much,
> Chuck
>
>
>  ------------------------------
> *From:* praveenesh kumar [praveenesh@gmail.com]
> *Sent:* Saturday, September 08, 2012 7:54 AM
> *To:* user@hive.apache.org
> *Subject:* Re: How to load csv data into HIVE
>
>  You can use hadoop streaming that would be much faster... Just run your
> cleaning shell script logic in map phase and it will be done in just few
> minutes. That will keep the data in HDFS.
>
> Regards,
> Praveenesh
>
> On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <
> sandeepreddy.3647@gmail.com> wrote:
>
>> Hi,
>> Thank you all for your help. I'll try both ways and i'll get back to you.
>>
>>
>> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> I said this assuming that a Hadoop cluster is available since Sandeep is
>>> planning to use Hive. If that is the case then MapReduce would be faster
>>> for such large files.
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>>
>>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Chuck.Connell@nuance.com
>>> > wrote:
>>>
>>>>  I cannot promise which is faster. A lot depends on how clever your
>>>> scripts are.****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>>> *Sent:* Friday, September 07, 2012 10:42 AM
>>>> *To:* user@hive.apache.org
>>>> *Subject:* Re: How to load csv data into HIVE****
>>>>
>>>> ** **
>>>>
>>>> Hi,
>>>> I wrote a shell script to get csv data but when i run that script on a
>>>> 12GB csv its taking more time. If i run a python script will that be faster?
>>>> ****
>>>>
>>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <
>>>> Chuck.Connell@nuance.com> wrote:****
>>>>
>>>> How about a Python script that changes it into plain tab-separated
>>>> text? So it would look like this…****
>>>>
>>>>  ****
>>>>
>>>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>>>> etc…****
>>>>
>>>>  ****
>>>>
>>>> Tab-separated with newlines is easy to read and works perfectly on
>>>> import.****
>>>>
>>>>  ****
>>>>
>>>> Chuck Connell****
>>>>
>>>> Nuance R&D Data Team****
>>>>
>>>> Burlington, MA****
>>>>
>>>> 781-565-4611****
>>>>
>>>>  ****
>>>>
>>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>>> *Subject:* How to load csv data into HIVE****
>>>>
>>>>  ****
>>>>
>>>> Hi,
>>>> Here is the sample data
>>>> "174969274","14-mar-2006","****
>>>>
>>>> 3522876","","14-mar-2006","500000308","65","1"|
>>>>
>>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>>>>
>>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>>>>
>>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>>>
>>>> How to load this kind of data into HIVE?
>>>> I'm using shell script to get rid of double quotes and '|' but its
>>>> taking very long time to work on each csv which are 12GB each. What is the
>>>> best way to do this?****
>>>>
>>>>  ****
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> sandeep****
>>>>
>>>
>>>
>>
>>
>>  --
>> Thanks,
>> sandeep
>>
>>
>

Re: How to load csv data into HIVE

Posted by Bejoy KS <be...@yahoo.com>.

Hi Chuck

I believe Praveenesh was adding his thought to the discussion on preprocessing the data using mapreduce itself. If you go with hadoop streaming you can use the python script in the mapper and that will do the preprocessing parallely on large volume data. Then this preprocessed data can be loaded into hive table.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Connell, Chuck" <Ch...@nuance.com>
Date: Sat, 8 Sep 2012 12:18:33 
To: user@hive.apache.org<us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: RE: How to load csv data into HIVE

I would like to hear more about this "hadoop streaming to Hive" idea. I have used streaming jobs as mappers, with a python script as map.py. Are you saying that such a streaming mapper can load its output into Hive? Can you send some example code? Hive wants to load "files" not individual lines/records. How would you do this?

Thanks very much,
Chuck


________________________________
From: praveenesh kumar [praveenesh@gmail.com]
Sent: Saturday, September 08, 2012 7:54 AM
To: user@hive.apache.org
Subject: Re: How to load csv data into HIVE

You can use hadoop streaming that would be much faster... Just run your cleaning shell script logic in map phase and it will be done in just few minutes. That will keep the data in HDFS.

Regards,
Praveenesh

On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <sa...@gmail.com>> wrote:
Hi,
Thank you all for your help. I'll try both ways and i'll get back to you.


On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <do...@gmail.com>> wrote:
I said this assuming that a Hadoop cluster is available since Sandeep is planning to use Hive. If that is the case then MapReduce would be faster for such large files.

Regards,
    Mohammad Tariq



On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Ch...@nuance.com>> wrote:
I cannot promise which is faster. A lot depends on how clever your scripts are.



From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com<ma...@gmail.com>]
Sent: Friday, September 07, 2012 10:42 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: How to load csv data into HIVE

Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster?
On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>> wrote:
How about a Python script that changes it into plain tab-separated text? So it would look like this…

174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
etc…

Tab-separated with newlines is easy to read and works perfectly on import.

Chuck Connell
Nuance R&D Data Team
Burlington, MA
781-565-4611<tel:781-565-4611>

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com<ma...@gmail.com>]
Subject: How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?




--
Thanks,
sandeep




--
Thanks,
sandeep

RE: How to load csv data into HIVE

Posted by "Connell, Chuck" <Ch...@nuance.com>.

I would like to hear more about this "hadoop streaming to Hive" idea. I have used streaming jobs as mappers, with a python script as map.py. Are you saying that such a streaming mapper can load its output into Hive? Can you send some example code? Hive wants to load "files" not individual lines/records. How would you do this?

Thanks very much,
Chuck

________________________________
From: praveenesh kumar [praveenesh@gmail.com]
Sent: Saturday, September 08, 2012 7:54 AM
To: user@hive.apache.org
Subject: Re: How to load csv data into HIVE

You can use hadoop streaming that would be much faster... Just run your cleaning shell script logic in map phase and it will be done in just few minutes. That will keep the data in HDFS.

Regards,
Praveenesh

On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <sa...@gmail.com>> wrote:
Hi,
Thank you all for your help. I'll try both ways and i'll get back to you.

On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <do...@gmail.com>> wrote:
I said this assuming that a Hadoop cluster is available since Sandeep is planning to use Hive. If that is the case then MapReduce would be faster for such large files.

Regards,
    Mohammad Tariq

On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Ch...@nuance.com>> wrote:
I cannot promise which is faster. A lot depends on how clever your scripts are.

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com<ma...@gmail.com>]
Sent: Friday, September 07, 2012 10:42 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: How to load csv data into HIVE

Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster?
On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>> wrote:
How about a Python script that changes it into plain tab-separated text? So it would look like this…

174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
etc…

Tab-separated with newlines is easy to read and works perfectly on import.

Chuck Connell
Nuance R&D Data Team
Burlington, MA
781-565-4611<tel:781-565-4611>

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com<ma...@gmail.com>]
Subject: How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?

--
Thanks,
sandeep

--
Thanks,
sandeep

Re: How to load csv data into HIVE

Posted by praveenesh kumar <pr...@gmail.com>.

You can use hadoop streaming that would be much faster... Just run your
cleaning shell script logic in map phase and it will be done in just few
minutes. That will keep the data in HDFS.

Regards,
Praveenesh

On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <sandeepreddy.3647@gmail.com
> wrote:

> Hi,
> Thank you all for your help. I'll try both ways and i'll get back to you.
>
>
> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> I said this assuming that a Hadoop cluster is available since Sandeep is
>> planning to use Hive. If that is the case then MapReduce would be faster
>> for such large files.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Ch...@nuance.com>wrote:
>>
>>>  I cannot promise which is faster. A lot depends on how clever your
>>> scripts are.****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>> *Sent:* Friday, September 07, 2012 10:42 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: How to load csv data into HIVE****
>>>
>>> ** **
>>>
>>> Hi,
>>> I wrote a shell script to get csv data but when i run that script on a
>>> 12GB csv its taking more time. If i run a python script will that be faster?
>>> ****
>>>
>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <
>>> Chuck.Connell@nuance.com> wrote:****
>>>
>>> How about a Python script that changes it into plain tab-separated text?
>>> So it would look like this…****
>>>
>>>  ****
>>>
>>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>>> etc…****
>>>
>>>  ****
>>>
>>> Tab-separated with newlines is easy to read and works perfectly on
>>> import.****
>>>
>>>  ****
>>>
>>> Chuck Connell****
>>>
>>> Nuance R&D Data Team****
>>>
>>> Burlington, MA****
>>>
>>> 781-565-4611****
>>>
>>>  ****
>>>
>>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>>> *Subject:* How to load csv data into HIVE****
>>>
>>>  ****
>>>
>>> Hi,
>>> Here is the sample data
>>> "174969274","14-mar-2006","****
>>>
>>> 3522876","","14-mar-2006","500000308","65","1"|
>>>
>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>>>
>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>>>
>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>>
>>> How to load this kind of data into HIVE?
>>> I'm using shell script to get rid of double quotes and '|' but its
>>> taking very long time to work on each csv which are 12GB each. What is the
>>> best way to do this?****
>>>
>>>  ****
>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> sandeep****
>>>
>>
>>
>
>
> --
> Thanks,
> sandeep
>
>

Re: How to load csv data into HIVE

Posted by Sandeep Reddy P <sa...@gmail.com>.

Hi,
Thank you all for your help. I'll try both ways and i'll get back to you.

On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <do...@gmail.com> wrote:

> I said this assuming that a Hadoop cluster is available since Sandeep is
> planning to use Hive. If that is the case then MapReduce would be faster
> for such large files.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Ch...@nuance.com>wrote:
>
>>  I cannot promise which is faster. A lot depends on how clever your
>> scripts are.****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>> *Sent:* Friday, September 07, 2012 10:42 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: How to load csv data into HIVE****
>>
>> ** **
>>
>> Hi,
>> I wrote a shell script to get csv data but when i run that script on a
>> 12GB csv its taking more time. If i run a python script will that be faster?
>> ****
>>
>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>
>> wrote:****
>>
>> How about a Python script that changes it into plain tab-separated text?
>> So it would look like this…****
>>
>>  ****
>>
>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>> etc…****
>>
>>  ****
>>
>> Tab-separated with newlines is easy to read and works perfectly on import.
>> ****
>>
>>  ****
>>
>> Chuck Connell****
>>
>> Nuance R&D Data Team****
>>
>> Burlington, MA****
>>
>> 781-565-4611****
>>
>>  ****
>>
>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>> *Subject:* How to load csv data into HIVE****
>>
>>  ****
>>
>> Hi,
>> Here is the sample data
>> "174969274","14-mar-2006","****
>>
>> 3522876","","14-mar-2006","500000308","65","1"|
>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>
>> How to load this kind of data into HIVE?
>> I'm using shell script to get rid of double quotes and '|' but its taking
>> very long time to work on each csv which are 12GB each. What is the best
>> way to do this?****
>>
>>  ****
>>
>>
>>
>>
>> --
>> Thanks,
>> sandeep****
>>
>
>


-- 
Thanks,
sandeep

Re: How to load csv data into HIVE

Posted by Mohammad Tariq <do...@gmail.com>.

I said this assuming that a Hadoop cluster is available since Sandeep is
planning to use Hive. If that is the case then MapReduce would be faster
for such large files.

Regards,
    Mohammad Tariq



On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <Ch...@nuance.com>wrote:

>  I cannot promise which is faster. A lot depends on how clever your
> scripts are.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
> *Sent:* Friday, September 07, 2012 10:42 AM
> *To:* user@hive.apache.org
> *Subject:* Re: How to load csv data into HIVE****
>
> ** **
>
> Hi,
> I wrote a shell script to get csv data but when i run that script on a
> 12GB csv its taking more time. If i run a python script will that be faster?
> ****
>
> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>
> wrote:****
>
> How about a Python script that changes it into plain tab-separated text?
> So it would look like this…****
>
>  ****
>
> 174969274<tab>14-mar-2006<tab>3522876<tab>
> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
> etc…****
>
>  ****
>
> Tab-separated with newlines is easy to read and works perfectly on import.
> ****
>
>  ****
>
> Chuck Connell****
>
> Nuance R&D Data Team****
>
> Burlington, MA****
>
> 781-565-4611****
>
>  ****
>
> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
> *Subject:* How to load csv data into HIVE****
>
>  ****
>
> Hi,
> Here is the sample data
> "174969274","14-mar-2006","****
>
> 3522876","","14-mar-2006","500000308","65","1"|
> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>
> How to load this kind of data into HIVE?
> I'm using shell script to get rid of double quotes and '|' but its taking
> very long time to work on each csv which are 12GB each. What is the best
> way to do this?****
>
>  ****
>
>
>
>
> --
> Thanks,
> sandeep****
>

RE: How to load csv data into HIVE

Posted by "Connell, Chuck" <Ch...@nuance.com>.

I cannot promise which is faster. A lot depends on how clever your scripts are.

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
Sent: Friday, September 07, 2012 10:42 AM
To: user@hive.apache.org
Subject: Re: How to load csv data into HIVE

Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster?
On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>> wrote:
How about a Python script that changes it into plain tab-separated text? So it would look like this...

174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
etc...

Tab-separated with newlines is easy to read and works perfectly on import.

Chuck Connell
Nuance R&D Data Team
Burlington, MA
781-565-4611<tel:781-565-4611>

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com<ma...@gmail.com>]
Subject: How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?

--
Thanks,
sandeep

Re: How to load csv data into HIVE

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Sandeep,

  I would suggest you to write a MapReduce job instead of usual sequential
program to transform your files. It would be much faster. Then use Hive to
load the data.

Regards,
    Mohammad Tariq



On Fri, Sep 7, 2012 at 8:11 PM, Sandeep Reddy P <sandeepreddy.3647@gmail.com
> wrote:

> Hi,
> I wrote a shell script to get csv data but when i run that script on a
> 12GB csv its taking more time. If i run a python script will that be faster?
>
>
> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>wrote:
>
>>  How about a Python script that changes it into plain tab-separated
>> text? So it would look like this…****
>>
>> ** **
>>
>> 174969274<tab>14-mar-2006<tab>3522876<tab>
>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
>> etc…****
>>
>> ** **
>>
>> Tab-separated with newlines is easy to read and works perfectly on import.
>> ****
>>
>> ** **
>>
>> Chuck Connell****
>>
>> Nuance R&D Data Team****
>>
>> Burlington, MA****
>>
>> 781-565-4611****
>>
>> ** **
>>
>> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
>> *Subject:* How to load csv data into HIVE****
>>
>> ** **
>>
>> Hi,
>> Here is the sample data
>> "174969274","14-mar-2006","****
>>
>> 3522876","","14-mar-2006","500000308","65","1"|
>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>>
>> How to load this kind of data into HIVE?
>> I'm using shell script to get rid of double quotes and '|' but its taking
>> very long time to work on each csv which are 12GB each. What is the best
>> way to do this?****
>>
>> ** **
>>
>
>
>
> --
> Thanks,
> sandeep
>
>

Re: How to load csv data into HIVE

Posted by Sandeep Reddy P <sa...@gmail.com>.

Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB
csv its taking more time. If i run a python script will that be faster?

On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <Ch...@nuance.com>wrote:

>  How about a Python script that changes it into plain tab-separated text?
> So it would look like this…****
>
> ** **
>
> 174969274<tab>14-mar-2006<tab>3522876<tab>
> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
> etc…****
>
> ** **
>
> Tab-separated with newlines is easy to read and works perfectly on import.
> ****
>
> ** **
>
> Chuck Connell****
>
> Nuance R&D Data Team****
>
> Burlington, MA****
>
> 781-565-4611****
>
> ** **
>
> *From:* Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
> *Subject:* How to load csv data into HIVE****
>
> ** **
>
> Hi,
> Here is the sample data
> "174969274","14-mar-2006","****
>
> 3522876","","14-mar-2006","500000308","65","1"|
> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
>
> How to load this kind of data into HIVE?
> I'm using shell script to get rid of double quotes and '|' but its taking
> very long time to work on each csv which are 12GB each. What is the best
> way to do this?****
>
> ** **
>



-- 
Thanks,
sandeep

RE: How to load csv data into HIVE

Posted by "Connell, Chuck" <Ch...@nuance.com>.

How about a Python script that changes it into plain tab-separated text? So it would look like this...

174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline>
etc...

Tab-separated with newlines is easy to read and works perfectly on import.

Chuck Connell
Nuance R&D Data Team
Burlington, MA
781-565-4611

From: Sandeep Reddy P [mailto:sandeepreddy.3647@gmail.com]
Subject: How to load csv data into HIVE

Hi,
Here is the sample data
"174969274","14-mar-2006","
3522876","","14-mar-2006","500000308","65","1"|
"174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
"174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
"174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|

How to load this kind of data into HIVE?
I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?

Re: How to load csv data into HIVE

Posted by Abhishek <ab...@gmail.com>.

So are you trying get rid of double quotes and pipe symbol??

Regards
Abhi

Sent from my iPhone

On Sep 7, 2012, at 10:18 AM, Sandeep Reddy P <sa...@gmail.com> wrote:

> Hi,
> Here is the sample data
> "174969274","14-mar-2006","
> 3522876","","14-mar-2006","500000308","65","1"|
> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"|
> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"|
> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"|
> 
> How to load this kind of data into HIVE?
> I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?
> 
> 
> 
> -- 
> Thanks,
> sandeep
>