You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Sunita Arvind <su...@gmail.com> on 2013/08/01 17:54:26 UTC

Re: Select statements return null

Thanks for your help Matouk,
I am using a JSON serde -
http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
(mentioned on this page - https://github.com/cloudera/cdh-twitter-example)

Attached is the table definition. I have tried with the fetch_xx fields in
the input file and table and also tried without them. The logs usually are
useful when there is an error in parsing. If there are no errors, the logs
don't show me anything. Or am I missing anything here?

I am also attaching 2 output samples. One of them (xxx20130729_0) has the
profileid and fetch_year, fetch_month, fetch_day and fetch_time added. I
need to get this working. For debugging purposes, I tried without these
fields (xx09July2013_20) and loading just the response json without
manipulations. While using this, I used the create table without the
corresponding fields in the definition as well. The results are not
consistent with this.

Is there a way (a debug flag) to make the jackson parser emit the current
token?

Sunita





On Wed, Jul 31, 2013 at 4:06 AM, Matouk IFTISSEN <matouk.iftissen@ysance.com
> wrote:

> Hello Sanita,
>
> If you use a JSON try to add the jar  'hive-json-serde.jar' before you
> upload your data in the final table. And also try to make your date
> attributes in String format first to debug (if this is the cause).
>
> I don't know if you are using an external table with regular expressions
> (regexp) to pasre your data?; if this is, can you send us the definition of
> table and the structure of a row from your data.
> the final way that I can suggest is to run an operation mapreduce over the
> table (select count (1) from your_table) and then see the log of jobtracker
> to debug the issue.
>
> hope this can help you ;)
>
>
>
>
> 2013/7/30 Sunita Arvind <su...@gmail.com>
>
>> Hi,
>>
>> I have written a script which generates JSON files, loads it into a
>> dictionary, adds a few attributes and uploads the modified files to HDFS.
>> After the files are generated, if I perform a select * from..; on the table
>> which points to this location, I get "null, null...." as the result. I also
>> tried without the added attributes and it did not make a difference. I
>> strongly suspect the data.
>> Currently I am using strip() to eliminate trailing and leading
>> whitespaces and newlines. Wondering if embedded "\n" that is, json string
>> objects containing "\n" in the value, causes such issues.
>> There are no parsing errors, so I am not able to debug this issue. Are
>> there any flags that I can set to figure out what is happening within the
>> parser code?
>>
>> I set this:
>> hive -hiveconf hive.root.logger=DEBUG,console
>>
>> But the output is not really useful:
>>
>> blocks=[LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
>> getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
>> 192.168.1.66:50010, 192.168.1.63:50010]}]
>>
>> lastLocatedBlock=LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
>> getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
>> 192.168.1.66:50010, 192.168.1.63:50010]}
>>   isLastBlockComplete=true}
>> 13/07/30 11:49:41 DEBUG hdfs.DFSClient: Connecting to datanode
>> 192.168.1.61:50010
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> null
>> 13/07/30 11:49:41 INFO exec.
>>
>> Also, the attributes I am adding are current year, month day and time. So
>> they are not null for any record. I even moved existing files which did not
>> have these fields set so that there are no records with these fields as
>> null. However, I dont think this is an issue as the advantage of JSON/Hive
>> JSON serde is that it allows object struct to be dynamic. Right?
>>
>> Any suggestion regarding debugging would be very helpful.
>>
>> thanks
>> Sunita
>>
>
>

Re: Select statements return null

Posted by Matouk IFTISSEN <ma...@ysance.com>.

Hello Sanita,
I looked today for your case and if I have a time I will test it, I just
took a look at the json-serde.jar    source code of the clas*s* JSONSerDe and
found ( I'm not a java developer ;) ) some thing that I suppose it can make
the wrong results looking to your data:

// Get a list of the table's column names.
    String colNamesStr = tbl.getProperty(serdeConstants.LIST_COLUMNS);
    colNames = Arrays.asList(colNamesStr.split(","));

the comma "," that your fields values contain also (additional to your
columns separators).

What I suggest is to start debugging step by step like this:
in the first make just  on or tow columns in your entity structure
(struct in your table definition)    and adjust your data also to make
your table parse this data (10 records with one or two attributes).

And test the select *, if that does not work make select count (1) and
look at the jobtracker log to   see what is happened.
If it works, it is good way and then add in the progress 'in french au
fure et à mesure ;)'  the rest of your attribute and test them.


Hope this can help you.
Matouk



2013/8/1 Sunita Arvind <su...@gmail.com>

> Thanks for your help Matouk,
> I am using a JSON serde -
> http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
> (mentioned on this page - https://github.com/cloudera/cdh-twitter-example)
>
> Attached is the table definition. I have tried with the fetch_xx fields in
> the input file and table and also tried without them. The logs usually are
> useful when there is an error in parsing. If there are no errors, the logs
> don't show me anything. Or am I missing anything here?
>
> I am also attaching 2 output samples. One of them (xxx20130729_0) has the
> profileid and fetch_year, fetch_month, fetch_day and fetch_time added. I
> need to get this working. For debugging purposes, I tried without these
> fields (xx09July2013_20) and loading just the response json without
> manipulations. While using this, I used the create table without the
> corresponding fields in the definition as well. The results are not
> consistent with this.
>
> Is there a way (a debug flag) to make the jackson parser emit the current
> token?
>
> Sunita
>
>
>
>
>
> On Wed, Jul 31, 2013 at 4:06 AM, Matouk IFTISSEN <
> matouk.iftissen@ysance.com> wrote:
>
>> Hello Sanita,
>>
>> If you use a JSON try to add the jar  'hive-json-serde.jar' before you
>> upload your data in the final table. And also try to make your date
>> attributes in String format first to debug (if this is the cause).
>>
>> I don't know if you are using an external table with regular expressions
>> (regexp) to pasre your data?; if this is, can you send us the definition of
>> table and the structure of a row from your data.
>> the final way that I can suggest is to run an operation mapreduce over
>> the table (select count (1) from your_table) and then see the log of
>> jobtracker to debug the issue.
>>
>> hope this can help you ;)
>>
>>
>>
>>
>> 2013/7/30 Sunita Arvind <su...@gmail.com>
>>
>>> Hi,
>>>
>>> I have written a script which generates JSON files, loads it into a
>>> dictionary, adds a few attributes and uploads the modified files to HDFS.
>>> After the files are generated, if I perform a select * from..; on the table
>>> which points to this location, I get "null, null...." as the result. I also
>>> tried without the added attributes and it did not make a difference. I
>>> strongly suspect the data.
>>> Currently I am using strip() to eliminate trailing and leading
>>> whitespaces and newlines. Wondering if embedded "\n" that is, json string
>>> objects containing "\n" in the value, causes such issues.
>>> There are no parsing errors, so I am not able to debug this issue. Are
>>> there any flags that I can set to figure out what is happening within the
>>> parser code?
>>>
>>> I set this:
>>> hive -hiveconf hive.root.logger=DEBUG,console
>>>
>>> But the output is not really useful:
>>>
>>> blocks=[LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
>>> getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
>>> 192.168.1.66:50010, 192.168.1.63:50010]}]
>>>
>>> lastLocatedBlock=LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
>>> getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
>>> 192.168.1.66:50010, 192.168.1.63:50010]}
>>>   isLastBlockComplete=true}
>>> 13/07/30 11:49:41 DEBUG hdfs.DFSClient: Connecting to datanode
>>> 192.168.1.61:50010
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> null
>>> 13/07/30 11:49:41 INFO exec.
>>>
>>> Also, the attributes I am adding are current year, month day and time.
>>> So they are not null for any record. I even moved existing files which did
>>> not have these fields set so that there are no records with these fields as
>>> null. However, I dont think this is an issue as the advantage of JSON/Hive
>>> JSON serde is that it allows object struct to be dynamic. Right?
>>>
>>> Any suggestion regarding debugging would be very helpful.
>>>
>>> thanks
>>> Sunita
>>>
>>
>>
>