You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Riesland, Zack" <Za...@sensus.com> on 2016/01/28 21:05:29 UTC

"Create external table" nulling data from source table

First time posting to this list. Please forgive me if I break etiquette. I'm looking for some help with getting data from hive to hbase.

I'm using HDP 2.2.8.

I have a compressed (zlib), orc-based hive table with 12 columns and billions of rows.

In order to get the data into hbase, I have to create a copy of the table as an "external" table, backed by CSV files (unless someone knows a better way).

Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv files backing the external table.

I've been doing this for almost a year, and MOST of the data ends up correct, but if I export a large amount of data, I end up with nulls where I shouldn't.

If I run the exact same query on the source table (compressed orc) and destination table (external text) I get null values in the results of the latter, but not the former.

However, if I only copy a small subset of the data to the text-based table, all the data is correct.

I also noticed that if I use an uncompressed source table, and then copy to an external text-based table, it happens much more often.

So, my (not-very-educated) guess is that this has to do with ORC files.

I know that there are alternatives to ORC, but Hortonworks strongly encourages us to use ORC for everything. I'm not even sure whether Parquet works with HDP.

Anyways, Is this a known bug?

Any ideas on how I can get around it without chopping up my data into multiple tables?

Re: "Create external table" nulling data from source table

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> And again: the same row is correct if I export a small set of data, and
>incorrect if I export a large set - so I think that file/data size has
>something to do with this.

My Phoenix vs LLAP benchmark hit size related issues in ETL.


In my case, the tipping point was >1 hdfs block per CSV file.

Generating CSV files compressed with SNAPPY was how I prevented the
old-style MapReduce splitters from arbitrarily chopping up those files on
block boundaries while loading.

>I just tested and if I take the orc table, copy it to a sequence file,
>and then copy to a csv "file", everything looks good.
...
> So, my (not-very-educated) guess is that this has to do with ORC files.

Yes, though somewhat indirectly. Check the output file sizes between those
two.

ORC -> SequenceFile -> Text

will produce smaller text files (more of them) than

ORC -> Text.

Cheers,
Gopal



RE: "Create external table" nulling data from source table

Posted by "Riesland, Zack" <Za...@sensus.com>.
Thanks Nick,

I don't think there should be any newline chars - the columns that are turning null are mostly Decimal, and none of my strings are longer than one word (no spaces).

And again: the same row is correct if I export a small set of data, and incorrect if I export a large set - so I think that file/data size has something to do with this.

That said, I think you have the right idea in general.

I just tested and if I take the orc table, copy it to a sequence file, and then copy to a csv "file", everything looks good.

Is that going to be my best workaround for this?

-----Original Message-----
From: Nicholas Hakobian [mailto:nicholas.hakobian@rallyhealth.com] 
Sent: Thursday, January 28, 2016 3:15 PM
To: user@hive.apache.org
Subject: Re: "Create external table" nulling data from source table

Do you have any fields with embedded newline characters? If so, certain hive output formats will parse the newline character as the end of row, and when importing, chances are the missing fields (now part of the next row) will be padded with nulls. This happens in Hive as well if you are using a TextFile intermediate format (SequenceFile does not have this problem).

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com



On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack <Za...@sensus.com> wrote:
> First time posting to this list. Please forgive me if I break 
> etiquette. I’m looking for some help with getting data from hive to hbase.
>
>
>
> I’m using HDP 2.2.8.
>
>
>
> I have a compressed (zlib), orc-based hive table with 12 columns and 
> billions of rows.
>
> In order to get the data into hbase, I have to create a copy of the 
> table as an "external" table, backed by CSV files (unless someone 
> knows a better way).
>
>
>
> Then, I use the CsvBulkLoad mapreduce job to create hfiles from the 
> csv files backing the external table.
>
> I’ve been doing this for almost a year, and MOST of the data ends up 
> correct, but if I export a large amount of data, I end up with nulls 
> where I shouldn't.
>
>
>
> If I run the exact same query on the source table (compressed orc) and 
> destination table (external text) I get null values in the results of 
> the latter, but not the former.
>
>
>
> However, if I only copy a small subset of the data to the text-based 
> table, all the data is correct.
>
>
>
> I also noticed that if I use an uncompressed source table, and then 
> copy to an external text-based table, it happens much more often.
>
>
>
> So, my (not-very-educated) guess is that this has to do with ORC files.
>
>
>
> I know that there are alternatives to ORC, but Hortonworks strongly 
> encourages us to use ORC for everything. I’m not even sure whether 
> Parquet works with HDP.
>
>
>
> Anyways, Is this a known bug?
>
> Any ideas on how I can get around it without chopping up my data into 
> multiple tables?

Re: "Create external table" nulling data from source table

Posted by Nicholas Hakobian <ni...@rallyhealth.com>.
Do you have any fields with embedded newline characters? If so,
certain hive output formats will parse the newline character as the
end of row, and when importing, chances are the missing fields (now
part of the next row) will be padded with nulls. This happens in Hive
as well if you are using a TextFile intermediate format (SequenceFile
does not have this problem).

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com



On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack
<Za...@sensus.com> wrote:
> First time posting to this list. Please forgive me if I break etiquette. I’m
> looking for some help with getting data from hive to hbase.
>
>
>
> I’m using HDP 2.2.8.
>
>
>
> I have a compressed (zlib), orc-based hive table with 12 columns and
> billions of rows.
>
> In order to get the data into hbase, I have to create a copy of the table as
> an "external" table, backed by CSV files (unless someone knows a better
> way).
>
>
>
> Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv
> files backing the external table.
>
> I’ve been doing this for almost a year, and MOST of the data ends up
> correct, but if I export a large amount of data, I end up with nulls where I
> shouldn't.
>
>
>
> If I run the exact same query on the source table (compressed orc) and
> destination table (external text) I get null values in the results of the
> latter, but not the former.
>
>
>
> However, if I only copy a small subset of the data to the text-based table,
> all the data is correct.
>
>
>
> I also noticed that if I use an uncompressed source table, and then copy to
> an external text-based table, it happens much more often.
>
>
>
> So, my (not-very-educated) guess is that this has to do with ORC files.
>
>
>
> I know that there are alternatives to ORC, but Hortonworks strongly
> encourages us to use ORC for everything. I’m not even sure whether Parquet
> works with HDP.
>
>
>
> Anyways, Is this a known bug?
>
> Any ideas on how I can get around it without chopping up my data into
> multiple tables?