You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Sebb (JIRA)" <ji...@apache.org> on 2015/10/22 23:49:27 UTC

[jira] [Resolved] (INFRA-10647) Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE

     [ https://issues.apache.org/jira/browse/INFRA-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb resolved INFRA-10647.
--------------------------
    Resolution: Invalid

This is the wrong JIRA; furthermore JIRA is not the correct place for usage questions.

Please subscribe to the Hive user list and ask there:

http://hive.apache.org/mailing_lists.html

> Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE
> --------------------------------------------------------------------------------------
>
>                 Key: INFRA-10647
>                 URL: https://issues.apache.org/jira/browse/INFRA-10647
>             Project: Infrastructure
>          Issue Type: Bug
>            Reporter: Joel
>
> After streaming twitter data to HDFS I'm trying to analyze it using some HIVE queries. The data is in JSON format and not clean having double quotes (") in wrong places causing the HIVE queries to fail. I am getting the following error:
> Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing '"' for name
> The script used for creating the external table:
> ADD JAR /usr/local/hive/apache-hive-1.2.1-bin/lib/hive-serdes-1.0-SNAPSHOT.jar;
> set hive.support.sql11.reserved.keywords = false;
> CREATE EXTERNAL TABLE tweets (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> retweet_count INT,
> retweeted_status STRUCT<
> text:STRING,
> user:STRUCT<screen_name:STRING,name:STRING>>,
> entities STRUCT<
> urls:ARRAY<STRUCT<expanded_url:STRING>>,
> user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
> hashtags:ARRAY<STRUCT<text:STRING>>>,
> text STRING,
> user STRUCT<
> screen_name:STRING,
> name:STRING,
> friends_count:INT,
> followers_count:INT,
> statuses_count:INT,
> verified:BOOLEAN,
> utc_offset:INT,
> time_zone:STRING>,
> in_reply_to_screen_name STRING
> )
> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
> LOCATION '/usr/local/hadoop/bin/tweets';
> Since I would not know for which row the extra double quotes is present, I can't put an escape character. How can I escape the junk characters and process the data successfully?
> Appreciate any help.
> Thanks, Joel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)