You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Suhail Doshi <su...@mixpanel.com> on 2009/04/04 07:26:50 UTC

Python + Hive + Load Data

I seem to be having problems with LOAD DATA with a file on my local system
trying get it into hive:

li57-125 ~/test: python hive_test.py
Connecting to HiveServer....
Opening transport...
LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
page_views
Traceback (most recent call last):
  File "hive_test.py", line 36, in <module>
    c.client.execute(query)
  File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
line 42, in execute
    self.recv_execute()
  File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
line 63, in recv_execute
    raise result.ex
hive_service.ttypes.HiveServerException: {}

The same query works fine through the hive client but doesn't seem to work
through the python file. Executing a query through the python client works
fine if it's not a LOAD DATA. Unfortunately, I wish there was a better
message to describe why the exception is occurring.

-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

Little bit of digging into problem #1, if you don't specify the partitions
it won't work.

Digging into question #1, seems you can't get past this but the work around
is by using partitions and just insert overwrite into an partition that is
dynamically created based on a timestamp to append data to the table.

Question #2, probably impossible atm :-(.

Suhail

On Tue, Apr 14, 2009 at 1:08 PM, Suhail Doshi <di...@gmail.com>wrote:

> Just a few issues:
>
> Problem #1:
> Ended Job = job_200904142000_0003
> Loading data to table page_views_prod
> 10688 Rows loaded to page_views_prod
> OK
> Time taken: 8.736 seconds
> hive> select * from page_views_prod;
> OK
> Time taken: 0.126 seconds
>
> Seems a bit odd that there are no rows even though it said 10k were loaded.
>
> Question #1:
> For: FROM page_views_stage pvs INSERT OVERWRITE TABLE
> page_views_prod                   SELECT pvs.project_code, pvs.page,
> pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360
>
> Can you not, not use OVERWRITE? I want to load data in the staging table
> into the production table but I want it to append the data not erase
> existing data.
>
> Question #2:
> For: FROM page_views_stage pvs INSERT TABLE
> page_views_prod                  PARTITION(p_code=pvs.project_code,
> dt=to_date(from_unixtime(pvs.created)))
>
> Is there anyway to dynamically do the partitions if you reference a table?
>
> Thank you in advance,
> Suhail
>
>
>
> On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> >>So evidentially LOAD DATA actually just copies a file to hdfs. What is
>> the solution if you have thousands of files and attempt a hive query because
>> my understanding is that this will be dead slow later.
>>
>> Loading thousands of files is very slow. I have an application reads
>> 9000 small text files from a web server. I was planning attempting to
>> write them with BufferedWriter - FSDataOutputStream my load time was
>> over 9 hours. I can't say if DFS copy is much faster. I took a look at
>> what Nutch is doing to handle the situation. Nutch allocates single
>> threaded MapRunners on several nodes and emits NutchDataum. Kinda a
>> crazy way to load in 9000 files :(
>>
>> Merging the smaller files locally might help as well.
>>
>> On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <di...@gmail.com>
>> wrote:
>> > Seems like you have to do something like this?
>> >
>> > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
>> >                     page_url STRING, referrer_url STRING,
>> >                     ip STRING COMMENT 'IP Address of the User',
>> >
>> >                     country STRING COMMENT 'country of origination')
>> >     COMMENT 'This is the staging page view table'
>> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY
>> '12'
>> >
>> >     STORED AS TEXTFILE
>> >     LOCATION '/user/data/stagging/page_view';
>> >
>> >     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
>> >
>> >     FROM page_view_stg pvs
>> >     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
>> > country='US')
>> >
>> >     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
>> null,
>> > null, pvs.ip
>> >     WHERE pvs.country = 'US';
>> >
>> > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <digitalwarfare@gmail.com
>> >
>> > wrote:
>> >>
>> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is
>> the
>> >> solution if you have thousands of files and attempt a hive query
>> because my
>> >> understanding is that this will be dead slow later.
>> >>
>> >> Suhail
>> >>
>> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <
>> digitalwarfare@gmail.com>
>> >> wrote:
>> >>>
>> >>> Ragu,
>> >>>
>> >>> I managed to get it working, seems there was just inconsistencies I
>> guess
>> >>> with metastore_db I was using in the client and the python one.
>> >>>
>> >>> I should just always use python from now on to make changes to
>> >>> metastore_db, instead of copying it around and using the hive client.
>> >>>
>> >>> Suhail
>> >>>
>> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <
>> digitalwarfare@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Oh nevermind, of course python is using the metastore_db that the
>> hive
>> >>>> service is using.
>> >>>>
>> >>>> Suhail
>> >>>>
>> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <
>> digitalwarfare@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> This is kind of odd, it's like it's not using the same metastore_db:
>> >>>>>
>> >>>>> li57-125 ~/test: ls
>> >>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db
>> page_view.log.2
>> >>>>>
>> >>>>> li57-125 ~/test: hive
>> >>>>> Hive history
>> >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>> >>>>> hive> select count(1) from page_views;
>> >>>>> Total MapReduce jobs = 1
>> >>>>> Number of reduce tasks determined at compile time: 1
>> >>>>> In order to change the average load for a reducer (in bytes):
>> >>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> >>>>> In order to limit the maximum number of reducers:
>> >>>>>   set hive.exec.reducers.max=<number>
>> >>>>> In order to set a constant number of reducers:
>> >>>>>   set mapred.reduce.tasks=<number>
>> >>>>> Job need not be submitted: no output: Success
>> >>>>> OK
>> >>>>> Time taken: 4.909 seconds
>> >>>>>
>> >>>>> li57-125 ~/test: python hive_test.py
>> >>>>> Connecting to HiveServer....
>> >>>>> Opening transport...
>> >>>>> select count(1) from page_views
>> >>>>> Number of rows:  ['20297']
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
>> >>>>> <di...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>> >>>>>>
>> >>>>>> Suhail
>> >>>>>>
>> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <
>> rmurthy@facebook.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Is there no entry in the server logs about the error?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> > I am running the hive server and hadoop on the same server as
>> the
>> >>>>>>> > file. I am
>> >>>>>>> > also running the python script and hive server under the same
>> user
>> >>>>>>> > and the
>> >>>>>>> > file is located in a directory this user owns.
>> >>>>>>> >
>> >>>>>>> > I am not sure why it's not loading it still.
>> >>>>>>> >
>> >>>>>>> > Suhail
>> >>>>>>> >
>> >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
>> >>>>>>> > <rm...@facebook.com> wrote:
>> >>>>>>> >> Is the file accessible to the HiveServer? We currently don't
>> ship
>> >>>>>>> >> the file
>> >>>>>>> >> from the client machine to the server machine.
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com>
>> wrote:
>> >>>>>>> >>
>> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>> >>>>>>> >>>> local system
>> >>>>>>> >>>> trying get it into hive:
>> >>>>>>> >>>>
>> >>>>>>> >>>> li57-125 ~/test: python hive_test.py
>> >>>>>>> >>>> Connecting to HiveServer....
>> >>>>>>> >>>> Opening transport...
>> >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2'
>> INTO
>> >>>>>>> >>>> TABLE
>> >>>>>>> >>>> page_views
>> >>>>>>> >>>> Traceback (most recent call last):
>> >>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>> >>>>>>> >>>>     c.client.execute(query)
>> >>>>>>> >>>>   File
>> >>>>>>> >>>>
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>>>>>> >>> line
>> >>>>>>> >>>> 42, in execute
>> >>>>>>> >>>>     self.recv_execute()
>> >>>>>>> >>>>   File
>> >>>>>>> >>>>
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>>>>>> >>> line
>> >>>>>>> >>>> 63, in recv_execute
>> >>>>>>> >>>>     raise result.ex
>> >>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>> >>>>>>> >>>>
>> >>>>>>> >>>> The same query works fine through the hive client but doesn't
>> >>>>>>> >>>> seem to work
>> >>>>>>> >>>> through the python file. Executing a query through the python
>> >>>>>>> >>>> client works
>> >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was
>> a
>> >>>>>>> >>>> better
>> >>>>>>> >>> message
>> >>>>>>> >>>> to describe why the exception is occurring.
>> >>>>>>> >>
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> http://mixpanel.com
>> >>>>>> Blog: http://blog.mixpanel.com
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> http://mixpanel.com
>> >>>>> Blog: http://blog.mixpanel.com
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> http://mixpanel.com
>> >>>> Blog: http://blog.mixpanel.com
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> http://mixpanel.com
>> >>> Blog: http://blog.mixpanel.com
>> >>
>> >>
>> >>
>> >> --
>> >> http://mixpanel.com
>> >> Blog: http://blog.mixpanel.com
>> >
>> >
>> >
>> > --
>> > http://mixpanel.com
>> > Blog: http://blog.mixpanel.com
>> >
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

Just a few issues:

Problem #1:
Ended Job = job_200904142000_0003
Loading data to table page_views_prod
10688 Rows loaded to page_views_prod
OK
Time taken: 8.736 seconds
hive> select * from page_views_prod;
OK
Time taken: 0.126 seconds

Seems a bit odd that there are no rows even though it said 10k were loaded.

Question #1:
For: FROM page_views_stage pvs INSERT OVERWRITE TABLE
page_views_prod                   SELECT pvs.project_code, pvs.page,
pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360

Can you not, not use OVERWRITE? I want to load data in the staging table
into the production table but I want it to append the data not erase
existing data.

Question #2:
For: FROM page_views_stage pvs INSERT TABLE page_views_prod
PARTITION(p_code=pvs.project_code, dt=to_date(from_unixtime(pvs.created)))

Is there anyway to dynamically do the partitions if you reference a table?

Thank you in advance,
Suhail


On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <ed...@gmail.com>wrote:

> >>So evidentially LOAD DATA actually just copies a file to hdfs. What is
> the solution if you have thousands of files and attempt a hive query because
> my understanding is that this will be dead slow later.
>
> Loading thousands of files is very slow. I have an application reads
> 9000 small text files from a web server. I was planning attempting to
> write them with BufferedWriter - FSDataOutputStream my load time was
> over 9 hours. I can't say if DFS copy is much faster. I took a look at
> what Nutch is doing to handle the situation. Nutch allocates single
> threaded MapRunners on several nodes and emits NutchDataum. Kinda a
> crazy way to load in 9000 files :(
>
> Merging the smaller files locally might help as well.
>
> On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <di...@gmail.com>
> wrote:
> > Seems like you have to do something like this?
> >
> > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
> >                     page_url STRING, referrer_url STRING,
> >                     ip STRING COMMENT 'IP Address of the User',
> >
> >                     country STRING COMMENT 'country of origination')
> >     COMMENT 'This is the staging page view table'
> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY
> '12'
> >
> >     STORED AS TEXTFILE
> >     LOCATION '/user/data/stagging/page_view';
> >
> >     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
> >
> >     FROM page_view_stg pvs
> >     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
> > country='US')
> >
> >     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
> null,
> > null, pvs.ip
> >     WHERE pvs.country = 'US';
> >
> > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <di...@gmail.com>
> > wrote:
> >>
> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is
> the
> >> solution if you have thousands of files and attempt a hive query because
> my
> >> understanding is that this will be dead slow later.
> >>
> >> Suhail
> >>
> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <digitalwarfare@gmail.com
> >
> >> wrote:
> >>>
> >>> Ragu,
> >>>
> >>> I managed to get it working, seems there was just inconsistencies I
> guess
> >>> with metastore_db I was using in the client and the python one.
> >>>
> >>> I should just always use python from now on to make changes to
> >>> metastore_db, instead of copying it around and using the hive client.
> >>>
> >>> Suhail
> >>>
> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <
> digitalwarfare@gmail.com>
> >>> wrote:
> >>>>
> >>>> Oh nevermind, of course python is using the metastore_db that the hive
> >>>> service is using.
> >>>>
> >>>> Suhail
> >>>>
> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <
> digitalwarfare@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> This is kind of odd, it's like it's not using the same metastore_db:
> >>>>>
> >>>>> li57-125 ~/test: ls
> >>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db
> page_view.log.2
> >>>>>
> >>>>> li57-125 ~/test: hive
> >>>>> Hive history
> >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
> >>>>> hive> select count(1) from page_views;
> >>>>> Total MapReduce jobs = 1
> >>>>> Number of reduce tasks determined at compile time: 1
> >>>>> In order to change the average load for a reducer (in bytes):
> >>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
> >>>>> In order to limit the maximum number of reducers:
> >>>>>   set hive.exec.reducers.max=<number>
> >>>>> In order to set a constant number of reducers:
> >>>>>   set mapred.reduce.tasks=<number>
> >>>>> Job need not be submitted: no output: Success
> >>>>> OK
> >>>>> Time taken: 4.909 seconds
> >>>>>
> >>>>> li57-125 ~/test: python hive_test.py
> >>>>> Connecting to HiveServer....
> >>>>> Opening transport...
> >>>>> select count(1) from page_views
> >>>>> Number of rows:  ['20297']
> >>>>>
> >>>>>
> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
> >>>>> <di...@gmail.com> wrote:
> >>>>>>
> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/
> >>>>>>
> >>>>>> Suhail
> >>>>>>
> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rmurthy@facebook.com
> >
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Is there no entry in the server logs about the error?
> >>>>>>>
> >>>>>>>
> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> > I am running the hive server and hadoop on the same server as the
> >>>>>>> > file. I am
> >>>>>>> > also running the python script and hive server under the same
> user
> >>>>>>> > and the
> >>>>>>> > file is located in a directory this user owns.
> >>>>>>> >
> >>>>>>> > I am not sure why it's not loading it still.
> >>>>>>> >
> >>>>>>> > Suhail
> >>>>>>> >
> >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
> >>>>>>> > <rm...@facebook.com> wrote:
> >>>>>>> >> Is the file accessible to the HiveServer? We currently don't
> ship
> >>>>>>> >> the file
> >>>>>>> >> from the client machine to the server machine.
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
> >>>>>>> >>
> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
> >>>>>>> >>>> local system
> >>>>>>> >>>> trying get it into hive:
> >>>>>>> >>>>
> >>>>>>> >>>> li57-125 ~/test: python hive_test.py
> >>>>>>> >>>> Connecting to HiveServer....
> >>>>>>> >>>> Opening transport...
> >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2'
> INTO
> >>>>>>> >>>> TABLE
> >>>>>>> >>>> page_views
> >>>>>>> >>>> Traceback (most recent call last):
> >>>>>>> >>>>   File "hive_test.py", line 36, in <module>
> >>>>>>> >>>>     c.client.execute(query)
> >>>>>>> >>>>   File
> >>>>>>> >>>>
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>>>>>> >>> line
> >>>>>>> >>>> 42, in execute
> >>>>>>> >>>>     self.recv_execute()
> >>>>>>> >>>>   File
> >>>>>>> >>>>
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>>>>>> >>> line
> >>>>>>> >>>> 63, in recv_execute
> >>>>>>> >>>>     raise result.ex
> >>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
> >>>>>>> >>>>
> >>>>>>> >>>> The same query works fine through the hive client but doesn't
> >>>>>>> >>>> seem to work
> >>>>>>> >>>> through the python file. Executing a query through the python
> >>>>>>> >>>> client works
> >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was
> a
> >>>>>>> >>>> better
> >>>>>>> >>> message
> >>>>>>> >>>> to describe why the exception is occurring.
> >>>>>>> >>
> >>>>>>> >
> >>>>>>> >
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> http://mixpanel.com
> >>>>>> Blog: http://blog.mixpanel.com
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://mixpanel.com
> >>>>> Blog: http://blog.mixpanel.com
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://mixpanel.com
> >>>> Blog: http://blog.mixpanel.com
> >>>
> >>>
> >>>
> >>> --
> >>> http://mixpanel.com
> >>> Blog: http://blog.mixpanel.com
> >>
> >>
> >>
> >> --
> >> http://mixpanel.com
> >> Blog: http://blog.mixpanel.com
> >
> >
> >
> > --
> > http://mixpanel.com
> > Blog: http://blog.mixpanel.com
> >
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Edward Capriolo <ed...@gmail.com>.

>>So evidentially LOAD DATA actually just copies a file to hdfs. What is the solution if you have thousands of files and attempt a hive query because my understanding is that this will be dead slow later.

Loading thousands of files is very slow. I have an application reads
9000 small text files from a web server. I was planning attempting to
write them with BufferedWriter - FSDataOutputStream my load time was
over 9 hours. I can't say if DFS copy is much faster. I took a look at
what Nutch is doing to handle the situation. Nutch allocates single
threaded MapRunners on several nodes and emits NutchDataum. Kinda a
crazy way to load in 9000 files :(

Merging the smaller files locally might help as well.

On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <di...@gmail.com> wrote:
> Seems like you have to do something like this?
>
> CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
>                     page_url STRING, referrer_url STRING,
>                     ip STRING COMMENT 'IP Address of the User',
>
>                     country STRING COMMENT 'country of origination')
>     COMMENT 'This is the staging page view table'
>     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '12'
>
>     STORED AS TEXTFILE
>     LOCATION '/user/data/stagging/page_view';
>
>     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
>
>     FROM page_view_stg pvs
>     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
> country='US')
>
>     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null,
> null, pvs.ip
>     WHERE pvs.country = 'US';
>
> On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <di...@gmail.com>
> wrote:
>>
>> So evidentially LOAD DATA actually just copies a file to hdfs. What is the
>> solution if you have thousands of files and attempt a hive query because my
>> understanding is that this will be dead slow later.
>>
>> Suhail
>>
>> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <di...@gmail.com>
>> wrote:
>>>
>>> Ragu,
>>>
>>> I managed to get it working, seems there was just inconsistencies I guess
>>> with metastore_db I was using in the client and the python one.
>>>
>>> I should just always use python from now on to make changes to
>>> metastore_db, instead of copying it around and using the hive client.
>>>
>>> Suhail
>>>
>>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <di...@gmail.com>
>>> wrote:
>>>>
>>>> Oh nevermind, of course python is using the metastore_db that the hive
>>>> service is using.
>>>>
>>>> Suhail
>>>>
>>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <di...@gmail.com>
>>>> wrote:
>>>>>
>>>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>>>
>>>>> li57-125 ~/test: ls
>>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>>>
>>>>> li57-125 ~/test: hive
>>>>> Hive history
>>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>>>> hive> select count(1) from page_views;
>>>>> Total MapReduce jobs = 1
>>>>> Number of reduce tasks determined at compile time: 1
>>>>> In order to change the average load for a reducer (in bytes):
>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> In order to limit the maximum number of reducers:
>>>>>   set hive.exec.reducers.max=<number>
>>>>> In order to set a constant number of reducers:
>>>>>   set mapred.reduce.tasks=<number>
>>>>> Job need not be submitted: no output: Success
>>>>> OK
>>>>> Time taken: 4.909 seconds
>>>>>
>>>>> li57-125 ~/test: python hive_test.py
>>>>> Connecting to HiveServer....
>>>>> Opening transport...
>>>>> select count(1) from page_views
>>>>> Number of rows:  ['20297']
>>>>>
>>>>>
>>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
>>>>> <di...@gmail.com> wrote:
>>>>>>
>>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>>>
>>>>>> Suhail
>>>>>>
>>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Is there no entry in the server logs about the error?
>>>>>>>
>>>>>>>
>>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>>>>>>
>>>>>>> > I am running the hive server and hadoop on the same server as the
>>>>>>> > file. I am
>>>>>>> > also running the python script and hive server under the same user
>>>>>>> > and the
>>>>>>> > file is located in a directory this user owns.
>>>>>>> >
>>>>>>> > I am not sure why it's not loading it still.
>>>>>>> >
>>>>>>> > Suhail
>>>>>>> >
>>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
>>>>>>> > <rm...@facebook.com> wrote:
>>>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>>>> >> the file
>>>>>>> >> from the client machine to the server machine.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>>>>>>> >>
>>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>>>> >>>> local system
>>>>>>> >>>> trying get it into hive:
>>>>>>> >>>>
>>>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>>>> >>>> Connecting to HiveServer....
>>>>>>> >>>> Opening transport...
>>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>>>> >>>> TABLE
>>>>>>> >>>> page_views
>>>>>>> >>>> Traceback (most recent call last):
>>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>>>> >>>>     c.client.execute(query)
>>>>>>> >>>>   File
>>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>>> >>> line
>>>>>>> >>>> 42, in execute
>>>>>>> >>>>     self.recv_execute()
>>>>>>> >>>>   File
>>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>>> >>> line
>>>>>>> >>>> 63, in recv_execute
>>>>>>> >>>>     raise result.ex
>>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>>>> >>>>
>>>>>>> >>>> The same query works fine through the hive client but doesn't
>>>>>>> >>>> seem to work
>>>>>>> >>>> through the python file. Executing a query through the python
>>>>>>> >>>> client works
>>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>>>> >>>> better
>>>>>>> >>> message
>>>>>>> >>>> to describe why the exception is occurring.
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://mixpanel.com
>>>>>> Blog: http://blog.mixpanel.com
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://mixpanel.com
>>>>> Blog: http://blog.mixpanel.com
>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

Seems like you have to do something like this?

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User',
                    country STRING COMMENT 'country of origination')
    COMMENT 'This is the staging page view table'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '12'
    STORED AS TEXTFILE
    LOCATION '/user/data/stagging/page_view';

    hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

    FROM page_view_stg pvs
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
    SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip
    WHERE pvs.country = 'US';



On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <di...@gmail.com>wrote:

> So evidentially LOAD DATA actually just copies a file to hdfs. What is the
> solution if you have thousands of files and attempt a hive query because my
> understanding is that this will be dead slow later.
>
> Suhail
>
>
> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <di...@gmail.com>wrote:
>
>> Ragu,
>>
>> I managed to get it working, seems there was just inconsistencies I guess
>> with metastore_db I was using in the client and the python one.
>>
>> I should just always use python from now on to make changes to
>> metastore_db, instead of copying it around and using the hive client.
>>
>> Suhail
>>
>>
>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <di...@gmail.com>wrote:
>>
>>> Oh nevermind, of course python is using the metastore_db that the hive
>>> service is using.
>>>
>>> Suhail
>>>
>>>
>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <di...@gmail.com>wrote:
>>>
>>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>>
>>>> li57-125 ~/test: ls
>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>>
>>>> li57-125 ~/test: hive
>>>> Hive history
>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>>> hive> select count(1) from page_views;
>>>> Total MapReduce jobs = 1
>>>> Number of reduce tasks determined at compile time: 1
>>>> In order to change the average load for a reducer (in bytes):
>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>> In order to limit the maximum number of reducers:
>>>>   set hive.exec.reducers.max=<number>
>>>> In order to set a constant number of reducers:
>>>>   set mapred.reduce.tasks=<number>
>>>> Job need not be submitted: no output: Success
>>>> OK
>>>> Time taken: 4.909 seconds
>>>>
>>>> li57-125 ~/test: python hive_test.py
>>>> Connecting to HiveServer....
>>>> Opening transport...
>>>> select count(1) from page_views
>>>> Number of rows:  ['20297']
>>>>
>>>>
>>>>
>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <digitalwarfare@gmail.com
>>>> > wrote:
>>>>
>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>>
>>>>> Suhail
>>>>>
>>>>>
>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>wrote:
>>>>>
>>>>>> Is there no entry in the server logs about the error?
>>>>>>
>>>>>>
>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>>>>>
>>>>>> > I am running the hive server and hadoop on the same server as the
>>>>>> file. I am
>>>>>> > also running the python script and hive server under the same user
>>>>>> and the
>>>>>> > file is located in a directory this user owns.
>>>>>> >
>>>>>> > I am not sure why it's not loading it still.
>>>>>> >
>>>>>> > Suhail
>>>>>> >
>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
>>>>>> wrote:
>>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>>> the file
>>>>>> >> from the client machine to the server machine.
>>>>>> >>
>>>>>> >>
>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>>>>>> >>
>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>>> local system
>>>>>> >>>> trying get it into hive:
>>>>>> >>>>
>>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>>> >>>> Connecting to HiveServer....
>>>>>> >>>> Opening transport...
>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>>> TABLE
>>>>>> >>>> page_views
>>>>>> >>>> Traceback (most recent call last):
>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>>> >>>>     c.client.execute(query)
>>>>>> >>>>   File
>>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>> >>> line
>>>>>> >>>> 42, in execute
>>>>>> >>>>     self.recv_execute()
>>>>>> >>>>   File
>>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>> >>> line
>>>>>> >>>> 63, in recv_execute
>>>>>> >>>>     raise result.ex
>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>>> >>>>
>>>>>> >>>> The same query works fine through the hive client but doesn't
>>>>>> seem to work
>>>>>> >>>> through the python file. Executing a query through the python
>>>>>> client works
>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>>> better
>>>>>> >>> message
>>>>>> >>>> to describe why the exception is occurring.
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://mixpanel.com
>>>>> Blog: http://blog.mixpanel.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>>
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>>
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

So evidentially LOAD DATA actually just copies a file to hdfs. What is the
solution if you have thousands of files and attempt a hive query because my
understanding is that this will be dead slow later.

Suhail

On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <di...@gmail.com>wrote:

> Ragu,
>
> I managed to get it working, seems there was just inconsistencies I guess
> with metastore_db I was using in the client and the python one.
>
> I should just always use python from now on to make changes to
> metastore_db, instead of copying it around and using the hive client.
>
> Suhail
>
>
> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <di...@gmail.com>wrote:
>
>> Oh nevermind, of course python is using the metastore_db that the hive
>> service is using.
>>
>> Suhail
>>
>>
>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <di...@gmail.com>wrote:
>>
>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>
>>> li57-125 ~/test: ls
>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>
>>> li57-125 ~/test: hive
>>> Hive history
>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>> hive> select count(1) from page_views;
>>> Total MapReduce jobs = 1
>>> Number of reduce tasks determined at compile time: 1
>>> In order to change the average load for a reducer (in bytes):
>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>> In order to limit the maximum number of reducers:
>>>   set hive.exec.reducers.max=<number>
>>> In order to set a constant number of reducers:
>>>   set mapred.reduce.tasks=<number>
>>> Job need not be submitted: no output: Success
>>> OK
>>> Time taken: 4.909 seconds
>>>
>>> li57-125 ~/test: python hive_test.py
>>> Connecting to HiveServer....
>>> Opening transport...
>>> select count(1) from page_views
>>> Number of rows:  ['20297']
>>>
>>>
>>>
>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <di...@gmail.com>wrote:
>>>
>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>
>>>> Suhail
>>>>
>>>>
>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>wrote:
>>>>
>>>>> Is there no entry in the server logs about the error?
>>>>>
>>>>>
>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>>>>
>>>>> > I am running the hive server and hadoop on the same server as the
>>>>> file. I am
>>>>> > also running the python script and hive server under the same user
>>>>> and the
>>>>> > file is located in a directory this user owns.
>>>>> >
>>>>> > I am not sure why it's not loading it still.
>>>>> >
>>>>> > Suhail
>>>>> >
>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
>>>>> wrote:
>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>> the file
>>>>> >> from the client machine to the server machine.
>>>>> >>
>>>>> >>
>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>>>>> >>
>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>> local system
>>>>> >>>> trying get it into hive:
>>>>> >>>>
>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>> >>>> Connecting to HiveServer....
>>>>> >>>> Opening transport...
>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>> TABLE
>>>>> >>>> page_views
>>>>> >>>> Traceback (most recent call last):
>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>> >>>>     c.client.execute(query)
>>>>> >>>>   File
>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>> >>> line
>>>>> >>>> 42, in execute
>>>>> >>>>     self.recv_execute()
>>>>> >>>>   File
>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>> >>> line
>>>>> >>>> 63, in recv_execute
>>>>> >>>>     raise result.ex
>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>> >>>>
>>>>> >>>> The same query works fine through the hive client but doesn't seem
>>>>> to work
>>>>> >>>> through the python file. Executing a query through the python
>>>>> client works
>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>> better
>>>>> >>> message
>>>>> >>>> to describe why the exception is occurring.
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>>
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>>
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

Ragu,

I managed to get it working, seems there was just inconsistencies I guess
with metastore_db I was using in the client and the python one.

I should just always use python from now on to make changes to metastore_db,
instead of copying it around and using the hive client.

Suhail

On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <di...@gmail.com>wrote:

> Oh nevermind, of course python is using the metastore_db that the hive
> service is using.
>
> Suhail
>
>
> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <di...@gmail.com>wrote:
>
>> This is kind of odd, it's like it's not using the same metastore_db:
>>
>> li57-125 ~/test: ls
>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>
>> li57-125 ~/test: hive
>> Hive history
>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>> hive> select count(1) from page_views;
>> Total MapReduce jobs = 1
>> Number of reduce tasks determined at compile time: 1
>> In order to change the average load for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>>   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>>   set mapred.reduce.tasks=<number>
>> Job need not be submitted: no output: Success
>> OK
>> Time taken: 4.909 seconds
>>
>> li57-125 ~/test: python hive_test.py
>> Connecting to HiveServer....
>> Opening transport...
>> select count(1) from page_views
>> Number of rows:  ['20297']
>>
>>
>>
>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <di...@gmail.com>wrote:
>>
>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>
>>> Suhail
>>>
>>>
>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>wrote:
>>>
>>>> Is there no entry in the server logs about the error?
>>>>
>>>>
>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>>>
>>>> > I am running the hive server and hadoop on the same server as the
>>>> file. I am
>>>> > also running the python script and hive server under the same user and
>>>> the
>>>> > file is located in a directory this user owns.
>>>> >
>>>> > I am not sure why it's not loading it still.
>>>> >
>>>> > Suhail
>>>> >
>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
>>>> wrote:
>>>> >> Is the file accessible to the HiveServer? We currently don't ship the
>>>> file
>>>> >> from the client machine to the server machine.
>>>> >>
>>>> >>
>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>>>> >>
>>>> >>>> I seem to be having problems with LOAD DATA with a file on my local
>>>> system
>>>> >>>> trying get it into hive:
>>>> >>>>
>>>> >>>> li57-125 ~/test: python hive_test.py
>>>> >>>> Connecting to HiveServer....
>>>> >>>> Opening transport...
>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>> TABLE
>>>> >>>> page_views
>>>> >>>> Traceback (most recent call last):
>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>> >>>>     c.client.execute(query)
>>>> >>>>   File
>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>> >>> line
>>>> >>>> 42, in execute
>>>> >>>>     self.recv_execute()
>>>> >>>>   File
>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>> >>> line
>>>> >>>> 63, in recv_execute
>>>> >>>>     raise result.ex
>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>> >>>>
>>>> >>>> The same query works fine through the hive client but doesn't seem
>>>> to work
>>>> >>>> through the python file. Executing a query through the python
>>>> client works
>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>> better
>>>> >>> message
>>>> >>>> to describe why the exception is occurring.
>>>> >>
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>>
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

Oh nevermind, of course python is using the metastore_db that the hive
service is using.

Suhail

On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <di...@gmail.com>wrote:

> This is kind of odd, it's like it's not using the same metastore_db:
>
> li57-125 ~/test: ls
> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>
> li57-125 ~/test: hive
> Hive history
> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
> hive> select count(1) from page_views;
> Total MapReduce jobs = 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Job need not be submitted: no output: Success
> OK
> Time taken: 4.909 seconds
>
> li57-125 ~/test: python hive_test.py
> Connecting to HiveServer....
> Opening transport...
> select count(1) from page_views
> Number of rows:  ['20297']
>
>
>
> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <di...@gmail.com>wrote:
>
>> No logs are generated when I run the python file in /tmp/hadoop/
>>
>> Suhail
>>
>>
>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>wrote:
>>
>>> Is there no entry in the server logs about the error?
>>>
>>>
>>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>>
>>> > I am running the hive server and hadoop on the same server as the file.
>>> I am
>>> > also running the python script and hive server under the same user and
>>> the
>>> > file is located in a directory this user owns.
>>> >
>>> > I am not sure why it's not loading it still.
>>> >
>>> > Suhail
>>> >
>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
>>> wrote:
>>> >> Is the file accessible to the HiveServer? We currently don't ship the
>>> file
>>> >> from the client machine to the server machine.
>>> >>
>>> >>
>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>>> >>
>>> >>>> I seem to be having problems with LOAD DATA with a file on my local
>>> system
>>> >>>> trying get it into hive:
>>> >>>>
>>> >>>> li57-125 ~/test: python hive_test.py
>>> >>>> Connecting to HiveServer....
>>> >>>> Opening transport...
>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>> TABLE
>>> >>>> page_views
>>> >>>> Traceback (most recent call last):
>>> >>>>   File "hive_test.py", line 36, in <module>
>>> >>>>     c.client.execute(query)
>>> >>>>   File
>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>> >>> line
>>> >>>> 42, in execute
>>> >>>>     self.recv_execute()
>>> >>>>   File
>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>> >>> line
>>> >>>> 63, in recv_execute
>>> >>>>     raise result.ex
>>> >>>> hive_service.ttypes.HiveServerException: {}
>>> >>>>
>>> >>>> The same query works fine through the hive client but doesn't seem
>>> to work
>>> >>>> through the python file. Executing a query through the python client
>>> works
>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>> better
>>> >>> message
>>> >>>> to describe why the exception is occurring.
>>> >>
>>> >
>>> >
>>>
>>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

This is kind of odd, it's like it's not using the same metastore_db:

li57-125 ~/test: ls
derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2

li57-125 ~/test: hive
Hive history
file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
hive> select count(1) from page_views;
Total MapReduce jobs = 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Job need not be submitted: no output: Success
OK
Time taken: 4.909 seconds

li57-125 ~/test: python hive_test.py
Connecting to HiveServer....
Opening transport...
select count(1) from page_views
Number of rows:  ['20297']


On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <di...@gmail.com>wrote:

> No logs are generated when I run the python file in /tmp/hadoop/
>
> Suhail
>
>
> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com>wrote:
>
>> Is there no entry in the server logs about the error?
>>
>>
>> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>>
>> > I am running the hive server and hadoop on the same server as the file.
>> I am
>> > also running the python script and hive server under the same user and
>> the
>> > file is located in a directory this user owns.
>> >
>> > I am not sure why it's not loading it still.
>> >
>> > Suhail
>> >
>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
>> wrote:
>> >> Is the file accessible to the HiveServer? We currently don't ship the
>> file
>> >> from the client machine to the server machine.
>> >>
>> >>
>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>> >>
>> >>>> I seem to be having problems with LOAD DATA with a file on my local
>> system
>> >>>> trying get it into hive:
>> >>>>
>> >>>> li57-125 ~/test: python hive_test.py
>> >>>> Connecting to HiveServer....
>> >>>> Opening transport...
>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
>> >>>> page_views
>> >>>> Traceback (most recent call last):
>> >>>>   File "hive_test.py", line 36, in <module>
>> >>>>     c.client.execute(query)
>> >>>>   File
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>> line
>> >>>> 42, in execute
>> >>>>     self.recv_execute()
>> >>>>   File
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>> line
>> >>>> 63, in recv_execute
>> >>>>     raise result.ex
>> >>>> hive_service.ttypes.HiveServerException: {}
>> >>>>
>> >>>> The same query works fine through the hive client but doesn't seem to
>> work
>> >>>> through the python file. Executing a query through the python client
>> works
>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>> better
>> >>> message
>> >>>> to describe why the exception is occurring.
>> >>
>> >
>> >
>>
>>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

No logs are generated when I run the python file in /tmp/hadoop/

Suhail

On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rm...@facebook.com> wrote:

> Is there no entry in the server logs about the error?
>
>
> On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:
>
> > I am running the hive server and hadoop on the same server as the file. I
> am
> > also running the python script and hive server under the same user and
> the
> > file is located in a directory this user owns.
> >
> > I am not sure why it's not loading it still.
> >
> > Suhail
> >
> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com>
> wrote:
> >> Is the file accessible to the HiveServer? We currently don't ship the
> file
> >> from the client machine to the server machine.
> >>
> >>
> >> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
> >>
> >>>> I seem to be having problems with LOAD DATA with a file on my local
> system
> >>>> trying get it into hive:
> >>>>
> >>>> li57-125 ~/test: python hive_test.py
> >>>> Connecting to HiveServer....
> >>>> Opening transport...
> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
> >>>> page_views
> >>>> Traceback (most recent call last):
> >>>>   File "hive_test.py", line 36, in <module>
> >>>>     c.client.execute(query)
> >>>>   File
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>> line
> >>>> 42, in execute
> >>>>     self.recv_execute()
> >>>>   File
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>> line
> >>>> 63, in recv_execute
> >>>>     raise result.ex
> >>>> hive_service.ttypes.HiveServerException: {}
> >>>>
> >>>> The same query works fine through the hive client but doesn't seem to
> work
> >>>> through the python file. Executing a query through the python client
> works
> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a better
> >>> message
> >>>> to describe why the exception is occurring.
> >>
> >
> >
>
>


-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Raghu Murthy <rm...@facebook.com>.

Is there no entry in the server logs about the error?


On 4/4/09 10:24 PM, "Suhail Doshi" <di...@gmail.com> wrote:

> I am running the hive server and hadoop on the same server as the file. I am
> also running the python script and hive server under the same user and the
> file is located in a directory this user owns.
> 
> I am not sure why it's not loading it still.
> 
> Suhail
> 
> On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com> wrote:
>> Is the file accessible to the HiveServer? We currently don't ship the file
>> from the client machine to the server machine.
>> 
>> 
>> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>> 
>>>> I seem to be having problems with LOAD DATA with a file on my local system
>>>> trying get it into hive:
>>>> 
>>>> li57-125 ~/test: python hive_test.py
>>>> Connecting to HiveServer....
>>>> Opening transport...
>>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
>>>> page_views
>>>> Traceback (most recent call last):
>>>>   File "hive_test.py", line 36, in <module>
>>>>     c.client.execute(query)
>>>>   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>> line
>>>> 42, in execute
>>>>     self.recv_execute()
>>>>   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>> line
>>>> 63, in recv_execute
>>>>     raise result.ex
>>>> hive_service.ttypes.HiveServerException: {}
>>>> 
>>>> The same query works fine through the hive client but doesn't seem to work
>>>> through the python file. Executing a query through the python client works
>>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a better
>>> message
>>>> to describe why the exception is occurring.
>> 
> 
>

Re: Python + Hive + Load Data

Posted by Suhail Doshi <di...@gmail.com>.

I am running the hive server and hadoop on the same server as the file. I am
also running the python script and hive server under the same user and the
file is located in a directory this user owns.

I am not sure why it's not loading it still.

Suhail

On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rm...@facebook.com> wrote:

> Is the file accessible to the HiveServer? We currently don't ship the file
> from the client machine to the server machine.
>
>
> On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:
>
> > I seem to be having problems with LOAD DATA with a file on my local
> system
> > trying get it into hive:
> >
> > li57-125 ~/test: python hive_test.py
> > Connecting to HiveServer....
> > Opening transport...
> > LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
> > page_views
> > Traceback (most recent call last):
> >   File "hive_test.py", line 36, in <module>
> >     c.client.execute(query)
> >   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> line
> > 42, in execute
> >     self.recv_execute()
> >   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> line
> > 63, in recv_execute
> >     raise result.ex
> > hive_service.ttypes.HiveServerException: {}
> >
> > The same query works fine through the hive client but doesn't seem to
> work
> > through the python file. Executing a query through the python client
> works
> > fine if it's not a LOAD DATA. Unfortunately, I wish there was a better
> message
> > to describe why the exception is occurring.
>
>


-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Posted by Raghu Murthy <rm...@facebook.com>.

Is the file accessible to the HiveServer? We currently don't ship the file
from the client machine to the server machine.


On 4/3/09 10:26 PM, "Suhail Doshi" <su...@mixpanel.com> wrote:

> I seem to be having problems with LOAD DATA with a file on my local system
> trying get it into hive:
> 
> li57-125 ~/test: python hive_test.py
> Connecting to HiveServer....
> Opening transport...
> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO TABLE
> page_views
> Traceback (most recent call last):
>   File "hive_test.py", line 36, in <module>
>     c.client.execute(query)
>   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", line
> 42, in execute
>     self.recv_execute()
>   File "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", line
> 63, in recv_execute
>     raise result.ex
> hive_service.ttypes.HiveServerException: {}
> 
> The same query works fine through the hive client but doesn't seem to work
> through the python file. Executing a query through the python client works
> fine if it's not a LOAD DATA. Unfortunately, I wish there was a better message
> to describe why the exception is occurring.