You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sunita Arvind <su...@gmail.com> on 2013/06/27 00:47:36 UTC

unable to partition the table

Hi,

I am unable to create a partitioned table.
The error I get is:
FAILED: ParseException line 37:16 mismatched input
'"jobs.values.postingDate.year"' expecting Identifier near '(' in column
specification

I tried referring to the columns in various ways,
S.jobs.values.postingDate.year, with quotes, without quotes, get the same
error. Also tried creating a partition by year alone. Still get the same
error.

Here is the create table statement:

create external table linkedin_JobSearch (
jobs STRUCT<
values : ARRAY<STRUCT<
company : STRUCT<
id : STRING,
name : STRING>,
postingDate : STRUCT<
day : STRING>,
descriptionSnippet : STRING,
expirationDate : STRUCT<
......
.......
locationDescription : STRING>>>
)
PARTITIONED BY ("jobs.values.postingDate.year" STRING,
"jobs.values.postingDate.month" STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
WITH SERDEPROPERTIES (
"company"="$.jobs.values.company.name",
"position"="$.jobs.values.position.title",
"customerJobCode"="$.jobs.values.customerJobCode",
"locationDescription"="$.jobs.values.locationDescription",
"jobPoster"="$.jobs.values.jobposter.headline"
)
LOCATION '/user/sunita/Linkedin/JobSearch';

I need to be able to partition this information. Please help.

regards
Sunita

Re: unable to partition the table

Posted by Stephen Sprague <sp...@gmail.com>.
sorry for the spam. but one correction.  json_pp appears to be part of the
perl core so if you have a modern perl you may already have it installed.


On Mon, Jul 1, 2013 at 1:31 PM, Stephen Sprague <sp...@gmail.com> wrote:

> ok. so i just learned of a perl script called "json_pp"  (for json
> pretty_prrint) that is a binary included in the distro for the perl JSON
> module.  You gotta figure there are analogous tools in other languages as
> well but given this is a binary it doesn't matter much what language its
> written in - it just works.
>
> so doing this:  $ hive -e 'select <your_json_column> from <your_table>' |
> json_pp
>
> you'd actually be able to see what your json looks like in a human
> readable fashion.
>
> get the binary (and the module) here: https://metacpan.org/module/JSON
>
>
> On Wed, Jun 26, 2013 at 4:38 PM, Sunita Arvind <su...@gmail.com>wrote:
>
>> Ok. Thanks Stephen. I will try that out.
>> Will update the group if I am able to get this to work. For now, I will
>> continue with non-partitioned table.
>>
>> regards
>> Sunita
>>
>>
>> On Wed, Jun 26, 2013 at 7:11 PM, Stephen Sprague <sp...@gmail.com>wrote:
>>
>>> it would appear to be that you may partition only by non-nested
>>> columns.  I would recommend transforming your original dataset into one
>>> where the first column is YYYYMM and the rest is your json object.  During
>>> this transformation you may also wish to make further optimizations as well
>>> since you'll be scanning every record.
>>>
>>> as always my 2 cents only.
>>>
>>>
>>> On Wed, Jun 26, 2013 at 3:47 PM, Sunita Arvind <su...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> I am unable to create a partitioned table.
>>>> The error I get is:
>>>> FAILED: ParseException line 37:16 mismatched input
>>>> '"jobs.values.postingDate.year"' expecting Identifier near '(' in column
>>>> specification
>>>>
>>>> I tried referring to the columns in various ways,
>>>> S.jobs.values.postingDate.year, with quotes, without quotes, get the same
>>>> error. Also tried creating a partition by year alone. Still get the same
>>>> error.
>>>>
>>>> Here is the create table statement:
>>>>
>>>> create external table linkedin_JobSearch (
>>>> jobs STRUCT<
>>>> values : ARRAY<STRUCT<
>>>> company : STRUCT<
>>>> id : STRING,
>>>> name : STRING>,
>>>> postingDate : STRUCT<
>>>> day : STRING>,
>>>> descriptionSnippet : STRING,
>>>> expirationDate : STRUCT<
>>>> ......
>>>> .......
>>>> locationDescription : STRING>>>
>>>> )
>>>> PARTITIONED BY ("jobs.values.postingDate.year" STRING,
>>>> "jobs.values.postingDate.month" STRING)
>>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>>>> WITH SERDEPROPERTIES (
>>>> "company"="$.jobs.values.company.name",
>>>>  "position"="$.jobs.values.position.title",
>>>> "customerJobCode"="$.jobs.values.customerJobCode",
>>>> "locationDescription"="$.jobs.values.locationDescription",
>>>> "jobPoster"="$.jobs.values.jobposter.headline"
>>>> )
>>>> LOCATION '/user/sunita/Linkedin/JobSearch';
>>>>
>>>> I need to be able to partition this information. Please help.
>>>>
>>>> regards
>>>> Sunita
>>>>
>>>
>>>
>>
>

Re: unable to partition the table

Posted by Stephen Sprague <sp...@gmail.com>.
ok. so i just learned of a perl script called "json_pp"  (for json
pretty_prrint) that is a binary included in the distro for the perl JSON
module.  You gotta figure there are analogous tools in other languages as
well but given this is a binary it doesn't matter much what language its
written in - it just works.

so doing this:  $ hive -e 'select <your_json_column> from <your_table>' |
json_pp

you'd actually be able to see what your json looks like in a human readable
fashion.

get the binary (and the module) here: https://metacpan.org/module/JSON


On Wed, Jun 26, 2013 at 4:38 PM, Sunita Arvind <su...@gmail.com>wrote:

> Ok. Thanks Stephen. I will try that out.
> Will update the group if I am able to get this to work. For now, I will
> continue with non-partitioned table.
>
> regards
> Sunita
>
>
> On Wed, Jun 26, 2013 at 7:11 PM, Stephen Sprague <sp...@gmail.com>wrote:
>
>> it would appear to be that you may partition only by non-nested columns.
>> I would recommend transforming your original dataset into one where the
>> first column is YYYYMM and the rest is your json object.  During this
>> transformation you may also wish to make further optimizations as well
>> since you'll be scanning every record.
>>
>> as always my 2 cents only.
>>
>>
>> On Wed, Jun 26, 2013 at 3:47 PM, Sunita Arvind <su...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I am unable to create a partitioned table.
>>> The error I get is:
>>> FAILED: ParseException line 37:16 mismatched input
>>> '"jobs.values.postingDate.year"' expecting Identifier near '(' in column
>>> specification
>>>
>>> I tried referring to the columns in various ways,
>>> S.jobs.values.postingDate.year, with quotes, without quotes, get the same
>>> error. Also tried creating a partition by year alone. Still get the same
>>> error.
>>>
>>> Here is the create table statement:
>>>
>>> create external table linkedin_JobSearch (
>>> jobs STRUCT<
>>> values : ARRAY<STRUCT<
>>> company : STRUCT<
>>> id : STRING,
>>> name : STRING>,
>>> postingDate : STRUCT<
>>> day : STRING>,
>>> descriptionSnippet : STRING,
>>> expirationDate : STRUCT<
>>> ......
>>> .......
>>> locationDescription : STRING>>>
>>> )
>>> PARTITIONED BY ("jobs.values.postingDate.year" STRING,
>>> "jobs.values.postingDate.month" STRING)
>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>>> WITH SERDEPROPERTIES (
>>> "company"="$.jobs.values.company.name",
>>>  "position"="$.jobs.values.position.title",
>>> "customerJobCode"="$.jobs.values.customerJobCode",
>>> "locationDescription"="$.jobs.values.locationDescription",
>>> "jobPoster"="$.jobs.values.jobposter.headline"
>>> )
>>> LOCATION '/user/sunita/Linkedin/JobSearch';
>>>
>>> I need to be able to partition this information. Please help.
>>>
>>> regards
>>> Sunita
>>>
>>
>>
>

Re: unable to partition the table

Posted by Sunita Arvind <su...@gmail.com>.
Ok. Thanks Stephen. I will try that out.
Will update the group if I am able to get this to work. For now, I will
continue with non-partitioned table.

regards
Sunita


On Wed, Jun 26, 2013 at 7:11 PM, Stephen Sprague <sp...@gmail.com> wrote:

> it would appear to be that you may partition only by non-nested columns.
> I would recommend transforming your original dataset into one where the
> first column is YYYYMM and the rest is your json object.  During this
> transformation you may also wish to make further optimizations as well
> since you'll be scanning every record.
>
> as always my 2 cents only.
>
>
> On Wed, Jun 26, 2013 at 3:47 PM, Sunita Arvind <su...@gmail.com>wrote:
>
>> Hi,
>>
>> I am unable to create a partitioned table.
>> The error I get is:
>> FAILED: ParseException line 37:16 mismatched input
>> '"jobs.values.postingDate.year"' expecting Identifier near '(' in column
>> specification
>>
>> I tried referring to the columns in various ways,
>> S.jobs.values.postingDate.year, with quotes, without quotes, get the same
>> error. Also tried creating a partition by year alone. Still get the same
>> error.
>>
>> Here is the create table statement:
>>
>> create external table linkedin_JobSearch (
>> jobs STRUCT<
>> values : ARRAY<STRUCT<
>> company : STRUCT<
>> id : STRING,
>> name : STRING>,
>> postingDate : STRUCT<
>> day : STRING>,
>> descriptionSnippet : STRING,
>> expirationDate : STRUCT<
>> ......
>> .......
>> locationDescription : STRING>>>
>> )
>> PARTITIONED BY ("jobs.values.postingDate.year" STRING,
>> "jobs.values.postingDate.month" STRING)
>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>> WITH SERDEPROPERTIES (
>> "company"="$.jobs.values.company.name",
>>  "position"="$.jobs.values.position.title",
>> "customerJobCode"="$.jobs.values.customerJobCode",
>> "locationDescription"="$.jobs.values.locationDescription",
>> "jobPoster"="$.jobs.values.jobposter.headline"
>> )
>> LOCATION '/user/sunita/Linkedin/JobSearch';
>>
>> I need to be able to partition this information. Please help.
>>
>> regards
>> Sunita
>>
>
>

Re: unable to partition the table

Posted by Stephen Sprague <sp...@gmail.com>.
it would appear to be that you may partition only by non-nested columns.  I
would recommend transforming your original dataset into one where the first
column is YYYYMM and the rest is your json object.  During this
transformation you may also wish to make further optimizations as well
since you'll be scanning every record.

as always my 2 cents only.


On Wed, Jun 26, 2013 at 3:47 PM, Sunita Arvind <su...@gmail.com>wrote:

> Hi,
>
> I am unable to create a partitioned table.
> The error I get is:
> FAILED: ParseException line 37:16 mismatched input
> '"jobs.values.postingDate.year"' expecting Identifier near '(' in column
> specification
>
> I tried referring to the columns in various ways,
> S.jobs.values.postingDate.year, with quotes, without quotes, get the same
> error. Also tried creating a partition by year alone. Still get the same
> error.
>
> Here is the create table statement:
>
> create external table linkedin_JobSearch (
> jobs STRUCT<
> values : ARRAY<STRUCT<
> company : STRUCT<
> id : STRING,
> name : STRING>,
> postingDate : STRUCT<
> day : STRING>,
> descriptionSnippet : STRING,
> expirationDate : STRUCT<
> ......
> .......
> locationDescription : STRING>>>
> )
> PARTITIONED BY ("jobs.values.postingDate.year" STRING,
> "jobs.values.postingDate.month" STRING)
> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
> WITH SERDEPROPERTIES (
> "company"="$.jobs.values.company.name",
>  "position"="$.jobs.values.position.title",
> "customerJobCode"="$.jobs.values.customerJobCode",
> "locationDescription"="$.jobs.values.locationDescription",
> "jobPoster"="$.jobs.values.jobposter.headline"
> )
> LOCATION '/user/sunita/Linkedin/JobSearch';
>
> I need to be able to partition this information. Please help.
>
> regards
> Sunita
>