You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mahesha999 <ab...@gmail.com> on 2016/07/07 12:31:45 UTC

Escaping separator in data while bulk loading using importtsv tool and ingesting numeric values

I am using importtsv tool to ingest data. I have some doubts. I am using
hbase 1.1.5.

First does it ingest non-string/numeric values? I was referring this link
<http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/>
detailing importtsv in cloudera distribution. It says:"it interprets
everything as strings". So I was guessing what does that mean.

I am using simple wordcount example where first column is a word and second
column is word count.

When I keep file as follows:

"access","1"
"about","1"

and ingest and then do scan on hbase shell it gives following output:

about column=f:count,
timestamp=1467716881104, value="1"
access column=f:count,
timestamp=1467716881104, value="1"

When I keep file as follows (double quotes surrounding count is removed):

"access",1
"about",1

and ingest and then do scan on hbase shell it gives following output (double
quotes surrounding count is not there):

about column=f:count,
timestamp=1467716881104, value=1
access column=f:count,
timestamp=1467716881104, value=1

So as you can see there are no double quotes in count's value. *Q1. Does
that mean it is stored as integer and not as string? * The cloudera's
article suggests that custom MR job needs to be written for ingesting
non-string values. However I am not able to get what does that mean if above
is ingesting integer values.

Also another doubt I am having is that whether I can escape the column
separator when it appears inside the column value. For example in importtsv,
we can specify the separator as follows:

-Dimporttsv.separator=,

However what if I have employee data where first column is employee name and
second column as address? My file will have rows resembling to something
like this:

"mahesh","A6,Hyatt Appartment"

That second comma makes importtsv think that there are three columns and
throwing BadTsvLineException("Excessive columns").

Thus I tried escaping comma with backslash ('\') and just for sake of
curiosity escaping backslash with another backslash (that is "\\"). So my
file had following lines:

"able","1\"
"z","1\"
"za","1\\1"

When I ran scan on hbase shell, it gave following output:

able column=f:count,
timestamp=1467716881104, value="1\x5C"
z column=f:count,
timestamp=1467716881104, value="1\x5C"
za column=f:count,
timestamp=1467716881104, value="1\x5C\x5C1"

*Q2. So it seems that instead of escaping character following backslash, it
encodes backslash as "\x5C". Is it like that? Is there no way to escape
column separator while bulk loading data using importtsv?*

--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Escaping-separator-in-data-while-bulk-loading-using-importtsv-tool-and-ingesting-numeric-values-tp4081081.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Escaping separator in data while bulk loading using importtsv tool and ingesting numeric values

Posted by Dima Spivak <ds...@cloudera.com>.

Hi Mahesha,

1.) HBase stores all values as byte arrays, so there's no typing to speak
of. ImportTsv is simply ingesting what it sees, quotes included (or not).

2.) ImportTsv doesn't support escaping, if I'm reading the code correctly. (
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
)

All the best,
  Dima

On Thursday, July 7, 2016, Mahesha999 <ab...@gmail.com> wrote:

> I am using importtsv tool to ingest data. I have some doubts. I am using
> hbase 1.1.5.
>
> First does it ingest non-string/numeric values? I was referring  this link
> <
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
> >
> detailing importtsv in cloudera distribution. It says:"it interprets
> everything as strings". So I was guessing what does that mean.
>
> I am using simple wordcount example where first column is a word and second
> column is word count.
>
> When I keep file as follows:
>
> "access","1"
> "about","1"
>
> and ingest and then do scan on hbase shell it gives following output:
>
>  about                                 column=f:count,
> timestamp=1467716881104, value="1"
>  access                                column=f:count,
> timestamp=1467716881104, value="1"
>
> When I keep file as follows (double quotes surrounding count is removed):
>
> "access",1
> "about",1
>
> and ingest and then do scan on hbase shell it gives following output
> (double
> quotes surrounding count is not there):
>
>  about                                 column=f:count,
> timestamp=1467716881104, value=1
>  access                                column=f:count,
> timestamp=1467716881104, value=1
>
> So as you can see there are no double quotes in count's value. *Q1. Does
> that mean it is stored as integer and not as string? * The cloudera's
> article suggests that custom MR job needs to be written for ingesting
> non-string values. However I am not able to get what does that mean if
> above
> is ingesting integer values.
>
> Also another doubt I am having is that whether I can escape the column
> separator when it appears inside the column value. For example in
> importtsv,
> we can specify the separator as follows:
>
> -Dimporttsv.separator=,
>
> However what if I have employee data where first column is employee name
> and
> second column as address? My file will have rows resembling to something
> like this:
>
> "mahesh","A6,Hyatt Appartment"
>
> That second comma makes importtsv think that there are three columns and
> throwing BadTsvLineException("Excessive columns").
>
> Thus I tried escaping comma with backslash ('\') and just for sake of
> curiosity escaping backslash with another backslash (that is "\\"). So my
> file had following lines:
>
> "able","1\"
> "z","1\"
> "za","1\\1"
>
> When I ran scan on hbase shell, it gave following output:
>
>  able                                  column=f:count,
> timestamp=1467716881104, value="1\x5C"
>  z                                     column=f:count,
> timestamp=1467716881104, value="1\x5C"
>  za                                    column=f:count,
> timestamp=1467716881104, value="1\x5C\x5C1"
>
> *Q2. So it seems that instead of escaping character following backslash, it
> encodes backslash as "\x5C". Is it like that? Is there no way to escape
> column separator while bulk loading data using importtsv?*
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Escaping-separator-in-data-while-bulk-loading-using-importtsv-tool-and-ingesting-numeric-values-tp4081081.html
> Sent from the HBase User mailing list archive at Nabble.com.
>