You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/10/01 22:22:09 UTC

Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Trying bulk load using Hfiles in Spark as below example:

import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles

So far no issues.

Then I do

val conf = HBaseConfiguration.create()
conf: org.apache.hadoop.conf.Configuration = Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
val tableName = "testTable"
tableName: String = testTable

But this one fails:

scala> val table = new HTable(conf, tableName)
java.io.IOException: java.lang.reflect.InvocationTargetException
  at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
  at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
  at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
  at
org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
  at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
  at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
  ... 52 elided
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.NoClassDefFoundError: org/apache/htrace/Trace
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
  ... 57 more
Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
  at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
  at
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
  at
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
  at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
  at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:648)
  ... 62 more
Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace

I have got all the jar files in spark-defaults.conf

spark.driver.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar


and also in Spark shell where I test the code

 --jars
/home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'

So any ideas will be appreciated.

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Benjamin Kim <bb...@gmail.com>.

It has been deemed production ready by the Kudu people as of version 1.0.0. As for stability, my trial runs haven’t encountered any problems with the current version. Before that, I ran into known issues during the beta period that were fixed later. Our test use case is, basically, bringing over events data from S3 using Spark Streaming to populate a table in Kudu. I let it run over the weekend to see how it would perform. I would not say this is a gauge of production stability though.

Cheers,
Ben

> On Oct 3, 2016, at 10:31 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi benjamin,
> 
> How stable is Kudu?
> 
> Is it production ready?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 18:08, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> If you’re interested, here is the link to the development page for Kudu. It has the Spark code snippets using DataFrames.
> 
> http://kudu.apache.org/docs/developing.html <http://kudu.apache.org/docs/developing.html>
> 
> Cheers,
> Ben
> 
>> On Oct 3, 2016, at 9:56 AM, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>> 
>> That sounds interesting, would love to learn more about it. 
>> 
>> Mitch: looks good. Lastly I would suggest you to think if you really need multiple column families. 
>> 
>> On 4 Oct 2016 02:57, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> Lately, I’ve been experimenting with Kudu. It has been a much better experience than with HBase. Using it is much simpler, even from spark-shell.
>> 
>> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
>> 
>> It’s like going back to rudimentary DB systems where tables have just a primary key and the columns. Additional benefits include a home-grown spark package, fast upserts and table scans for analytics, time-series support just introduced, and (my favorite) simpler configuration and administration. It has just gone to version 1.0.0; so, I’m waiting for 1.0.1+ before I propose it as our HBase replacement for some bugs to shake out. All my performance tests have been stellar versus HBase especially with its simplicity.
>> 
>> Just a thought…
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> I decided to create a composite key ticker-date from the csv file
>>> 
>>> I just did some manipulation on CSV file 
>>> 
>>> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp tsco.csv
>>> 
>>> Which basically takes the csv file, tells the shell that field separator IFS=",", drops the header, reads every field in every line (1,b,c ..), creates the composite key TSCO-$a, adds the stock name and ticker to the csv file. The whole process can be automated and parameterised.
>>> 
>>> Once the csv file is put into HDFS then, I run the following command
>>> 
>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/ <>stocks/tsco.csv
>>> 
>>> The Hbase table is created as below
>>> 
>>> create 'tsco','stock_info','stock_daily'
>>> 
>>> and this is the data (2 rows each 2 family and with 8 attributes)
>>> 
>>> hbase(main):132:0> scan 'tsco', LIMIT => 2
>>> ROW                                                    COLUMN+CELL
>>>  TSCO-1-Apr-08                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>>>  TSCO-1-Apr-08                                         column=stock_daily:close, timestamp=1475507091676, value=405.25
>>>  TSCO-1-Apr-08                                         column=stock_daily:high, timestamp=1475507091676, value=406.75
>>>  TSCO-1-Apr-08                                         column=stock_daily:low, timestamp=1475507091676, value=379.25
>>>  TSCO-1-Apr-08                                         column=stock_daily:open, timestamp=1475507091676, value=380.00
>>>  TSCO-1-Apr-08                                         column=stock_daily:volume, timestamp=1475507091676, value=49664486
>>>  TSCO-1-Apr-08                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>>  TSCO-1-Apr-08                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>>  
>>>  TSCO-1-Apr-09                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>>>  TSCO-1-Apr-09                                         column=stock_daily:close, timestamp=1475507091676, value=333.30
>>>  TSCO-1-Apr-09                                         column=stock_daily:high, timestamp=1475507091676, value=334.60
>>>  TSCO-1-Apr-09                                         column=stock_daily:low, timestamp=1475507091676, value=326.50
>>>  TSCO-1-Apr-09                                         column=stock_daily:open, timestamp=1475507091676, value=331.10
>>>  TSCO-1-Apr-09                                         column=stock_daily:volume, timestamp=1475507091676, value=24877341
>>>  TSCO-1-Apr-09                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>>  TSCO-1-Apr-09                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>> 
>>> Any suggestions
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 14:42, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> or may be add ticker+date like similar
>>> 
>>> 
>>> <image.png>
>>> 
>>> So the new row key would be TSCO-1-Apr-08 
>>> 
>>> and this will be added as row key. Both Date and ticker will stay as they are as column family attributes?
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 14:32, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> with ticker+date I can c reate something like below for row key
>>> 
>>> TSCO_1-Apr-08 
>>> 
>>> 
>>> or TSCO1-Apr-08
>>> 
>>> if I understood you correctly
>>>                     
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 13:13, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>> Hi
>>> 
>>> Looks like you are saving to new.csv but still loading tsco.csv? Its definitely the header.
>>> 
>>> Suggestion: ticker+date as row key has following benefits:
>>> 
>>> 1. using ticker+date as row key will enable you to hold multiple ticker in this single hbase table. (Think composite primary key)
>>> 2. Using date itself as row key will lead to hotspots (Look up hotspoting due to monotonically increasing row key). To distribute the load, it is suggested to use a salting. Ticker can be used as a natural salt in this case. 
>>> 3. Also, you may want to hash the rowkey value to give it little more flexible (Think surrogate key). 
>>> 
>>> 
>>> 
>>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> Hi Ayan,
>>> 
>>> Sounds like the row key has to be unique much like a primary key in RDBMS
>>> 
>>> This is what I download as a csv for stock from Google Finance
>>> 
>>>   Date	Open	High	Low	Close	Volume
>>> 27-Sep-16	177.4	177.75	172.5	177.75	24117196
>>> 
>>> 
>>> So What I do I add the stock and ticker myself to end of the row via shell script and get rid of header
>>> 
>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' > new.csv
>>> 
>>> The New table has two column families: stock_price, stock_info and row key date (one row per date)
>>> 
>>> This creates a new csv file with two additional columns appended to the end of each line
>>> 
>>> Then I run the following command 
>>> 
>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close, stock_daily:volume, stock_info:stock, stock_info:ticker" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>>> 
>>> This is in Hbase table for a given day
>>> 
>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>> ROW                                                    COLUMN+CELL
>>>  1-Apr-08                                              column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>  1-Apr-08                                              column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>  1-Apr-08                                              column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>  1-Apr-08                                              column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>  1-Apr-08                                              column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>  1-Apr-08                                              column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>  1-Apr-08                                              column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>> 
>>>   
>>> But I also have this at the bottom
>>> 
>>>   Date                                                  column=stock_daily:close, timestamp=1475491189158, value=Close
>>>  Date                                                  column=stock_daily:high, timestamp=1475491189158, value=High
>>>  Date                                                  column=stock_daily:low, timestamp=1475491189158, value=Low
>>>  Date                                                  column=stock_daily:open, timestamp=1475491189158, value=Open
>>>  Date                                                  column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>  Date                                                  column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>  Date                                                  column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>> 
>>> Sounds like the table header?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 11:24, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>> I am not well versed with importtsv, but you can create a CSV file using a simple spark program to create first column as ticker+tradedate. I remember doing similar manipulation to create row key format in pig. 
>>> 
>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> Thanks Ayan,
>>> 
>>> How do you specify ticker+rtrade as row key in the below
>>> 
>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>>> 
>>> I always thought that Hbase will take the first column as row key so it takes stock as the row key which is tsco plc for every row!
>>> 
>>> Does row key need to be unique?
>>> 
>>> cheers
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 10:30, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>> Hi Mitch
>>> 
>>> It is more to do with hbase than spark.
>>> 
>>> Row key can be anything, yes but essentially what you are doing is insert and update tesco PLC row. Given your schema, ticker+trade date seems to be a good row key
>>> 
>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> thanks again.
>>> 
>>> I added that jar file to the classpath and that part worked.
>>> 
>>> I was using spark shell so I have to use spark-submit for it to be able to interact with map-reduce job.
>>> 
>>> BTW when I use the command line utility ImportTsv  to load a file into Hbase with the following table format
>>> 
>>> describe 'marketDataHbase'
>>> Table marketDataHbase is ENABLED
>>> marketDataHbase
>>> COLUMN FAMILIES DESCRIPTION
>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>> 1 row(s) in 0.0930 seconds
>>> 
>>> 
>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>>> 
>>> There are with 1200 rows in the csv file, but it only loads the first row!
>>> 
>>> scan 'tsco'
>>> ROW                                                    COLUMN+CELL
>>>  Tesco PLC                                             column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>  Tesco PLC                                             column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>  Tesco PLC                                             column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>  Tesco PLC                                             column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>  Tesco PLC                                             column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>  Tesco PLC                                             column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>  Tesco PLC                                             column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>> 1 row(s) in 0.0390 seconds
>>> 
>>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I thought that the row key can be anything.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 3 October 2016 at 07:44, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that Cloudera generated because it was customized to add jars first from paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it worked. We could read/write to HBase. 
>>> 
>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Thanks Ben
>>>> 
>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>> 
>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>>  
>>>> 
>>>> On 1 October 2016 at 23:39, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>> Mich,
>>>> 
>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary.
>>>> 
>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt
>>>> 
>>>> Hope this helps.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>> 
>>>>> import org.apache.spark._
>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>> import org.apache.hadoop.fs.Path;
>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>> import org.apache.hadoop.mapred.JobConf
>>>>> import org.apache.hadoop.hbase.io <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable
>>>>> import org.apache.hadoop.mapreduce.Jo <http://org.apache.hadoop.mapreduce.jo/>b
>>>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.input.FileInputFormat
>>>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.output.FileOutputFormat
>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>> 
>>>>> So far no issues.
>>>>> 
>>>>> Then I do
>>>>> 
>>>>> val conf = HBaseConfiguration.create()
>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>> val tableName = "testTable"
>>>>> tableName: String = testTable
>> ...
> 
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi *be*njamin,

How stable is Kudu?

Is it production ready?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 18:08, Benjamin Kim <bb...@gmail.com> wrote:

> If you’re interested, here is the link to the development page for Kudu.
> It has the Spark code snippets using DataFrames.
>
> http://kudu.apache.org/docs/developing.html
>
> Cheers,
> Ben
>
> On Oct 3, 2016, at 9:56 AM, ayan guha <gu...@gmail.com> wrote:
>
> That sounds interesting, would love to learn more about it.
>
> Mitch: looks good. Lastly I would suggest you to think if you really need
> multiple column families.
> On 4 Oct 2016 02:57, "Benjamin Kim" <bb...@gmail.com> wrote:
>
>> Lately, I’ve been experimenting with Kudu. It has been a much better
>> experience than with HBase. Using it is much simpler, even from spark-shell.
>>
>> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
>>
>> It’s like going back to rudimentary DB systems where tables have just a
>> primary key and the columns. Additional benefits include a home-grown spark
>> package, fast upserts and table scans for analytics, time-series support
>> just introduced, and (my favorite) simpler configuration and
>> administration. It has just gone to version 1.0.0; so, I’m waiting for
>> 1.0.1+ before I propose it as our HBase replacement for some bugs to shake
>> out. All my performance tests have been stellar versus HBase especially
>> with its simplicity.
>>
>> Just a thought…
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I decided to create a composite key *ticker-date* from the csv file
>>
>> I just did some manipulation on CSV file
>>
>> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f;
>> do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp
>> tsco.csv
>>
>> Which basically takes the csv file, tells the shell that field separator
>> IFS=",", drops the header, reads every field in every line (1,b,c ..),
>> creates the composite key TSCO-$a, adds the stock name and ticker to the
>> csv file. The whole process can be automated and parameterised.
>>
>> Once the csv file is put into HDFS then, I run the following command
>>
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW
>> _KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto
>> ck_daily:open,stock_daily:high,stock_daily:low,stock_daily:c
>> lose,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
>>
>> The Hbase table is created as below
>>
>> create 'tsco','stock_info','stock_daily'
>>
>> and this is the data (2 rows each 2 family and with 8 attributes)
>>
>> hbase(main):132:0> scan 'tsco', LIMIT => 2
>> ROW                                                    COLUMN+CELL
>>  TSCO-1-Apr-08
>> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>>  TSCO-1-Apr-08
>> column=stock_daily:close, timestamp=1475507091676, value=405.25
>>  TSCO-1-Apr-08
>> column=stock_daily:high, timestamp=1475507091676, value=406.75
>>  TSCO-1-Apr-08
>> column=stock_daily:low, timestamp=1475507091676, value=379.25
>>  TSCO-1-Apr-08
>> column=stock_daily:open, timestamp=1475507091676, value=380.00
>>  TSCO-1-Apr-08
>> column=stock_daily:volume, timestamp=1475507091676, value=49664486
>>  TSCO-1-Apr-08
>> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-08
>> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>
>>  TSCO-1-Apr-09
>> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>>  TSCO-1-Apr-09
>> column=stock_daily:close, timestamp=1475507091676, value=333.30
>>  TSCO-1-Apr-09
>> column=stock_daily:high, timestamp=1475507091676, value=334.60
>>  TSCO-1-Apr-09
>> column=stock_daily:low, timestamp=1475507091676, value=326.50
>>  TSCO-1-Apr-09
>> column=stock_daily:open, timestamp=1475507091676, value=331.10
>>  TSCO-1-Apr-09
>> column=stock_daily:volume, timestamp=1475507091676, value=24877341
>>  TSCO-1-Apr-09
>> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-09
>> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>
>> Any suggestions
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 14:42, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> or may be add ticker+date like similar
>>>
>>>
>>> <image.png>
>>>
>>> So the new row key would be TSCO-1-Apr-08
>>>
>>> and this will be added as row key. Both Date and ticker will stay as
>>> they are as column family attributes?
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 October 2016 at 14:32, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> with ticker+date I can c reate something like below for row key
>>>>
>>>> TSCO_1-Apr-08
>>>>
>>>>
>>>> or TSCO1-Apr-08
>>>>
>>>> if I understood you correctly
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 3 October 2016 at 13:13, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Looks like you are saving to new.csv but still loading tsco.csv? Its
>>>>> definitely the header.
>>>>>
>>>>> Suggestion: ticker+date as row key has following benefits:
>>>>>
>>>>> 1. using ticker+date as row key will enable you to hold multiple
>>>>> ticker in this single hbase table. (Think composite primary key)
>>>>> 2. Using date itself as row key will lead to hotspots (Look up
>>>>> hotspoting due to monotonically increasing row key). To distribute the
>>>>> load, it is suggested to use a salting. Ticker can be used as a natural
>>>>> salt in this case.
>>>>> 3. Also, you may want to hash the rowkey value to give it little more
>>>>> flexible (Think surrogate key).
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebzadeh@
>>>>> gmail.com> wrote:
>>>>>
>>>>>> Hi Ayan,
>>>>>>
>>>>>> Sounds like the row key has to be unique much like a primary key in
>>>>>> RDBMS
>>>>>>
>>>>>> This is what I download as a csv for stock from Google Finance
>>>>>>
>>>>>>   Date Open High Low Close Volume
>>>>>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>>>>>
>>>>>>
>>>>>> So What I do I add the stock and ticker myself to end of the row via
>>>>>> shell script and get rid of header
>>>>>>
>>>>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>>>>>> new.csv
>>>>>>
>>>>>> The New table has two column families: stock_price, stock_info and
>>>>>> row key date (one row per date)
>>>>>>
>>>>>> This creates a new csv file with two additional columns appended to
>>>>>> the end of each line
>>>>>>
>>>>>> Then I run the following command
>>>>>>
>>>>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>>>>>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>
>>>>>> This is in Hbase table for a given day
>>>>>>
>>>>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>>>>> ROW                                                    COLUMN+CELL
>>>>>>  1-Apr-08
>>>>>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>>>>  1-Apr-08
>>>>>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>>>>  1-Apr-08
>>>>>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>>>>  1-Apr-08
>>>>>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>>>>  1-Apr-08
>>>>>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>>>>  1-Apr-08
>>>>>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>>>>  1-Apr-08
>>>>>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>>>>>
>>>>>>
>>>>>> But I also have this at the bottom
>>>>>>
>>>>>>   Date
>>>>>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>>>>>  Date
>>>>>> column=stock_daily:high, timestamp=1475491189158, value=High
>>>>>>  Date
>>>>>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>>>>>  Date
>>>>>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>>>>>  Date
>>>>>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>>>>  Date
>>>>>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>>>>  Date
>>>>>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>>>>>
>>>>>> Sounds like the table header?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> I am not well versed with importtsv, but you can create a CSV file
>>>>>>> using a simple spark program to create first column as ticker+tradedate. I
>>>>>>> remember doing similar manipulation to create row key format in pig.
>>>>>>>
>>>>>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Ayan,
>>>>>>>>
>>>>>>>> How do you specify ticker+rtrade as row key in the below
>>>>>>>>
>>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>>
>>>>>>>> I always thought that Hbase will take the first column as row key
>>>>>>>> so it takes stock as the row key which is tsco plc for every row!
>>>>>>>>
>>>>>>>> Does row key need to be unique?
>>>>>>>>
>>>>>>>> cheers
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Mitch
>>>>>>>>>
>>>>>>>>> It is more to do with hbase than spark.
>>>>>>>>>
>>>>>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>>>>>>> to be a good row key
>>>>>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> thanks again.
>>>>>>>>>>
>>>>>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>>>>>
>>>>>>>>>> I was using spark shell so I have to use spark-submit for it to
>>>>>>>>>> be able to interact with map-reduce job.
>>>>>>>>>>
>>>>>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>>>>>> into Hbase with the following table format
>>>>>>>>>>
>>>>>>>>>> describe 'marketDataHbase'
>>>>>>>>>> Table marketDataHbase is ENABLED
>>>>>>>>>> marketDataHbase
>>>>>>>>>> COLUMN FAMILIES DESCRIPTION
>>>>>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>>>>
>>>>>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>>>>>> first row!*
>>>>>>>>>>
>>>>>>>>>> scan 'tsco'
>>>>>>>>>> ROW
>>>>>>>>>> COLUMN+CELL
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>>>>>  Tesco PLC
>>>>>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>>>>>
>>>>>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for
>>>>>>>>>> all? I thought that the row key can be anything.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any other
>>>>>>>>>> property which may arise from relying on this email's technical content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>>>>>>> it worked. We could read/write to HBase.
>>>>>>>>>>>
>>>>>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks Ben
>>>>>>>>>>>
>>>>>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>>>>>
>>>>>>>>>>> Is this approach to reading/writing to Hbase specific to
>>>>>>>>>>> Cloudera?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any other
>>>>>>>>>>> property which may arise from relying on this email's technical content is
>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Mich,
>>>>>>>>>>>>
>>>>>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>>>>>
>>>>>>>>>>>> echo "/opt/cloudera/parcels/CDH/jar
>>>>>>>>>>>> s/htrace-core-3.2.0-incubating.jar" >>
>>>>>>>>>>>> /etc/spark/conf/classpath.txt
>>>>>>>>>>>>
>>>>>>>>>>>> Hope this helps.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>>>>>
>>>>>>>>>>>> import org.apache.spark._
>>>>>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>>>>>> HTableDescriptor}
>>>>>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>>>>>>> import org.apache.hadoop.mapreduce.Jo
>>>>>>>>>>>> <http://org.apache.hadoop.mapreduce.jo/>b
>>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>>>>>
>>>>>>>>>>>> So far no issues.
>>>>>>>>>>>>
>>>>>>>>>>>> Then I do
>>>>>>>>>>>>
>>>>>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>>>>>> val tableName = "testTable"
>>>>>>>>>>>> tableName: String = testTable
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>
>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Benjamin Kim <bb...@gmail.com>.

If you’re interested, here is the link to the development page for Kudu. It has the Spark code snippets using DataFrames.

http://kudu.apache.org/docs/developing.html <http://kudu.apache.org/docs/developing.html>

Cheers,
Ben

> On Oct 3, 2016, at 9:56 AM, ayan guha <gu...@gmail.com> wrote:
> 
> That sounds interesting, would love to learn more about it. 
> 
> Mitch: looks good. Lastly I would suggest you to think if you really need multiple column families. 
> 
> On 4 Oct 2016 02:57, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Lately, I’ve been experimenting with Kudu. It has been a much better experience than with HBase. Using it is much simpler, even from spark-shell.
> 
> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
> 
> It’s like going back to rudimentary DB systems where tables have just a primary key and the columns. Additional benefits include a home-grown spark package, fast upserts and table scans for analytics, time-series support just introduced, and (my favorite) simpler configuration and administration. It has just gone to version 1.0.0; so, I’m waiting for 1.0.1+ before I propose it as our HBase replacement for some bugs to shake out. All my performance tests have been stellar versus HBase especially with its simplicity.
> 
> Just a thought…
> 
> Cheers,
> Ben
> 
> 
>> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I decided to create a composite key ticker-date from the csv file
>> 
>> I just did some manipulation on CSV file 
>> 
>> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp tsco.csv
>> 
>> Which basically takes the csv file, tells the shell that field separator IFS=",", drops the header, reads every field in every line (1,b,c ..), creates the composite key TSCO-$a, adds the stock name and ticker to the csv file. The whole process can be automated and parameterised.
>> 
>> Once the csv file is put into HDFS then, I run the following command
>> 
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/ <>stocks/tsco.csv
>> 
>> The Hbase table is created as below
>> 
>> create 'tsco','stock_info','stock_daily'
>> 
>> and this is the data (2 rows each 2 family and with 8 attributes)
>> 
>> hbase(main):132:0> scan 'tsco', LIMIT => 2
>> ROW                                                    COLUMN+CELL
>>  TSCO-1-Apr-08                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>>  TSCO-1-Apr-08                                         column=stock_daily:close, timestamp=1475507091676, value=405.25
>>  TSCO-1-Apr-08                                         column=stock_daily:high, timestamp=1475507091676, value=406.75
>>  TSCO-1-Apr-08                                         column=stock_daily:low, timestamp=1475507091676, value=379.25
>>  TSCO-1-Apr-08                                         column=stock_daily:open, timestamp=1475507091676, value=380.00
>>  TSCO-1-Apr-08                                         column=stock_daily:volume, timestamp=1475507091676, value=49664486
>>  TSCO-1-Apr-08                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-08                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>>  
>>  TSCO-1-Apr-09                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>>  TSCO-1-Apr-09                                         column=stock_daily:close, timestamp=1475507091676, value=333.30
>>  TSCO-1-Apr-09                                         column=stock_daily:high, timestamp=1475507091676, value=334.60
>>  TSCO-1-Apr-09                                         column=stock_daily:low, timestamp=1475507091676, value=326.50
>>  TSCO-1-Apr-09                                         column=stock_daily:open, timestamp=1475507091676, value=331.10
>>  TSCO-1-Apr-09                                         column=stock_daily:volume, timestamp=1475507091676, value=24877341
>>  TSCO-1-Apr-09                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>>  TSCO-1-Apr-09                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>> 
>> Any suggestions
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 14:42, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> or may be add ticker+date like similar
>> 
>> 
>> <image.png>
>> 
>> So the new row key would be TSCO-1-Apr-08 
>> 
>> and this will be added as row key. Both Date and ticker will stay as they are as column family attributes?
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 14:32, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> with ticker+date I can c reate something like below for row key
>> 
>> TSCO_1-Apr-08 
>> 
>> 
>> or TSCO1-Apr-08
>> 
>> if I understood you correctly
>>                     
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 13:13, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>> Hi
>> 
>> Looks like you are saving to new.csv but still loading tsco.csv? Its definitely the header.
>> 
>> Suggestion: ticker+date as row key has following benefits:
>> 
>> 1. using ticker+date as row key will enable you to hold multiple ticker in this single hbase table. (Think composite primary key)
>> 2. Using date itself as row key will lead to hotspots (Look up hotspoting due to monotonically increasing row key). To distribute the load, it is suggested to use a salting. Ticker can be used as a natural salt in this case. 
>> 3. Also, you may want to hash the rowkey value to give it little more flexible (Think surrogate key). 
>> 
>> 
>> 
>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> Hi Ayan,
>> 
>> Sounds like the row key has to be unique much like a primary key in RDBMS
>> 
>> This is what I download as a csv for stock from Google Finance
>> 
>>   Date	Open	High	Low	Close	Volume
>> 27-Sep-16	177.4	177.75	172.5	177.75	24117196
>> 
>> 
>> So What I do I add the stock and ticker myself to end of the row via shell script and get rid of header
>> 
>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' > new.csv
>> 
>> The New table has two column families: stock_price, stock_info and row key date (one row per date)
>> 
>> This creates a new csv file with two additional columns appended to the end of each line
>> 
>> Then I run the following command 
>> 
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close, stock_daily:volume, stock_info:stock, stock_info:ticker" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> This is in Hbase table for a given day
>> 
>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>> ROW                                                    COLUMN+CELL
>>  1-Apr-08                                              column=stock_daily:close, timestamp=1475492248665, value=405.25
>>  1-Apr-08                                              column=stock_daily:high, timestamp=1475492248665, value=406.75
>>  1-Apr-08                                              column=stock_daily:low, timestamp=1475492248665, value=379.25
>>  1-Apr-08                                              column=stock_daily:open, timestamp=1475492248665, value=380.00
>>  1-Apr-08                                              column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>  1-Apr-08                                              column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>  1-Apr-08                                              column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>> 
>>   
>> But I also have this at the bottom
>> 
>>   Date                                                  column=stock_daily:close, timestamp=1475491189158, value=Close
>>  Date                                                  column=stock_daily:high, timestamp=1475491189158, value=High
>>  Date                                                  column=stock_daily:low, timestamp=1475491189158, value=Low
>>  Date                                                  column=stock_daily:open, timestamp=1475491189158, value=Open
>>  Date                                                  column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>  Date                                                  column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>  Date                                                  column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>> 
>> Sounds like the table header?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 11:24, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>> I am not well versed with importtsv, but you can create a CSV file using a simple spark program to create first column as ticker+tradedate. I remember doing similar manipulation to create row key format in pig. 
>> 
>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> Thanks Ayan,
>> 
>> How do you specify ticker+rtrade as row key in the below
>> 
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> I always thought that Hbase will take the first column as row key so it takes stock as the row key which is tsco plc for every row!
>> 
>> Does row key need to be unique?
>> 
>> cheers
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 10:30, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>> Hi Mitch
>> 
>> It is more to do with hbase than spark.
>> 
>> Row key can be anything, yes but essentially what you are doing is insert and update tesco PLC row. Given your schema, ticker+trade date seems to be a good row key
>> 
>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> thanks again.
>> 
>> I added that jar file to the classpath and that part worked.
>> 
>> I was using spark shell so I have to use spark-submit for it to be able to interact with map-reduce job.
>> 
>> BTW when I use the command line utility ImportTsv  to load a file into Hbase with the following table format
>> 
>> describe 'marketDataHbase'
>> Table marketDataHbase is ENABLED
>> marketDataHbase
>> COLUMN FAMILIES DESCRIPTION
>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>> 1 row(s) in 0.0930 seconds
>> 
>> 
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stock <>s/tsco.csv
>> 
>> There are with 1200 rows in the csv file, but it only loads the first row!
>> 
>> scan 'tsco'
>> ROW                                                    COLUMN+CELL
>>  Tesco PLC                                             column=stock_daily:close, timestamp=1475447365118, value=325.25
>>  Tesco PLC                                             column=stock_daily:high, timestamp=1475447365118, value=332.00
>>  Tesco PLC                                             column=stock_daily:low, timestamp=1475447365118, value=324.00
>>  Tesco PLC                                             column=stock_daily:open, timestamp=1475447365118, value=331.75
>>  Tesco PLC                                             column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>  Tesco PLC                                             column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>  Tesco PLC                                             column=stock_daily:volume, timestamp=1475447365118, value=46935045
>> 1 row(s) in 0.0390 seconds
>> 
>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I thought that the row key can be anything.
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>  
>> 
>> On 3 October 2016 at 07:44, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that Cloudera generated because it was customized to add jars first from paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it worked. We could read/write to HBase. 
>> 
>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Thanks Ben
>>> 
>>> The thing is I am using Spark 2 and no stack from CDH!
>>> 
>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
>>>  
>>> 
>>> On 1 October 2016 at 23:39, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>> Mich,
>>> 
>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary.
>>> 
>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt
>>> 
>>> Hope this helps.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Trying bulk load using Hfiles in Spark as below example:
>>>> 
>>>> import org.apache.spark._
>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>> import org.apache.hadoop.hbase.util.Bytes
>>>> import org.apache.hadoop.hbase.client.Put;
>>>> import org.apache.hadoop.hbase.client.HTable;
>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>> import org.apache.hadoop.mapred.JobConf
>>>> import org.apache.hadoop.hbase.io <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable
>>>> import org.apache.hadoop.mapreduce.Jo <http://org.apache.hadoop.mapreduce.jo/>b
>>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.input.FileInputFormat
>>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.output.FileOutputFormat
>>>> import org.apache.hadoop.hbase.KeyValue
>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>> 
>>>> So far no issues.
>>>> 
>>>> Then I do
>>>> 
>>>> val conf = HBaseConfiguration.create()
>>>> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>> val tableName = "testTable"
>>>> tableName: String = testTable
> ...

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by ayan guha <gu...@gmail.com>.

That sounds interesting, would love to learn more about it.

Mitch: looks good. Lastly I would suggest you to think if you really need
multiple column families.
On 4 Oct 2016 02:57, "Benjamin Kim" <bb...@gmail.com> wrote:

> Lately, I’ve been experimenting with Kudu. It has been a much better
> experience than with HBase. Using it is much simpler, even from spark-shell.
>
> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
>
> It’s like going back to rudimentary DB systems where tables have just a
> primary key and the columns. Additional benefits include a home-grown spark
> package, fast upserts and table scans for analytics, time-series support
> just introduced, and (my favorite) simpler configuration and
> administration. It has just gone to version 1.0.0; so, I’m waiting for
> 1.0.1+ before I propose it as our HBase replacement for some bugs to shake
> out. All my performance tests have been stellar versus HBase especially
> with its simplicity.
>
> Just a thought…
>
> Cheers,
> Ben
>
>
> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> I decided to create a composite key *ticker-date* from the csv file
>
> I just did some manipulation on CSV file
>
> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f;
> do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp
> tsco.csv
>
> Which basically takes the csv file, tells the shell that field separator
> IFS=",", drops the header, reads every field in every line (1,b,c ..),
> creates the composite key TSCO-$a, adds the stock name and ticker to the
> csv file. The whole process can be automated and parameterised.
>
> Once the csv file is put into HDFS then, I run the following command
>
> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW
> _KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto
> ck_daily:open,stock_daily:high,stock_daily:low,stock_daily:c
> lose,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
>
> The Hbase table is created as below
>
> create 'tsco','stock_info','stock_daily'
>
> and this is the data (2 rows each 2 family and with 8 attributes)
>
> hbase(main):132:0> scan 'tsco', LIMIT => 2
> ROW                                                    COLUMN+CELL
>  TSCO-1-Apr-08
> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>  TSCO-1-Apr-08
> column=stock_daily:close, timestamp=1475507091676, value=405.25
>  TSCO-1-Apr-08
> column=stock_daily:high, timestamp=1475507091676, value=406.75
>  TSCO-1-Apr-08
> column=stock_daily:low, timestamp=1475507091676, value=379.25
>  TSCO-1-Apr-08
> column=stock_daily:open, timestamp=1475507091676, value=380.00
>  TSCO-1-Apr-08
> column=stock_daily:volume, timestamp=1475507091676, value=49664486
>  TSCO-1-Apr-08
> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-08
> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>
>  TSCO-1-Apr-09
> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>  TSCO-1-Apr-09
> column=stock_daily:close, timestamp=1475507091676, value=333.30
>  TSCO-1-Apr-09
> column=stock_daily:high, timestamp=1475507091676, value=334.60
>  TSCO-1-Apr-09
> column=stock_daily:low, timestamp=1475507091676, value=326.50
>  TSCO-1-Apr-09
> column=stock_daily:open, timestamp=1475507091676, value=331.10
>  TSCO-1-Apr-09
> column=stock_daily:volume, timestamp=1475507091676, value=24877341
>  TSCO-1-Apr-09
> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-09
> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>
> Any suggestions
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 14:42, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> or may be add ticker+date like similar
>>
>>
>> <image.png>
>>
>> So the new row key would be TSCO-1-Apr-08
>>
>> and this will be added as row key. Both Date and ticker will stay as they
>> are as column family attributes?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 14:32, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> with ticker+date I can c reate something like below for row key
>>>
>>> TSCO_1-Apr-08
>>>
>>>
>>> or TSCO1-Apr-08
>>>
>>> if I understood you correctly
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 October 2016 at 13:13, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Looks like you are saving to new.csv but still loading tsco.csv? Its
>>>> definitely the header.
>>>>
>>>> Suggestion: ticker+date as row key has following benefits:
>>>>
>>>> 1. using ticker+date as row key will enable you to hold multiple ticker
>>>> in this single hbase table. (Think composite primary key)
>>>> 2. Using date itself as row key will lead to hotspots (Look up
>>>> hotspoting due to monotonically increasing row key). To distribute the
>>>> load, it is suggested to use a salting. Ticker can be used as a natural
>>>> salt in this case.
>>>> 3. Also, you may want to hash the rowkey value to give it little more
>>>> flexible (Think surrogate key).
>>>>
>>>>
>>>>
>>>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hi Ayan,
>>>>>
>>>>> Sounds like the row key has to be unique much like a primary key in
>>>>> RDBMS
>>>>>
>>>>> This is what I download as a csv for stock from Google Finance
>>>>>
>>>>>   Date Open High Low Close Volume
>>>>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>>>>
>>>>>
>>>>> So What I do I add the stock and ticker myself to end of the row via
>>>>> shell script and get rid of header
>>>>>
>>>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>>>>> new.csv
>>>>>
>>>>> The New table has two column families: stock_price, stock_info and row
>>>>> key date (one row per date)
>>>>>
>>>>> This creates a new csv file with two additional columns appended to
>>>>> the end of each line
>>>>>
>>>>> Then I run the following command
>>>>>
>>>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>>>>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>
>>>>> This is in Hbase table for a given day
>>>>>
>>>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>>>> ROW                                                    COLUMN+CELL
>>>>>  1-Apr-08
>>>>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>>>  1-Apr-08
>>>>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>>>  1-Apr-08
>>>>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>>>  1-Apr-08
>>>>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>>>  1-Apr-08
>>>>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>>>  1-Apr-08
>>>>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>>>  1-Apr-08
>>>>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>>>>
>>>>>
>>>>> But I also have this at the bottom
>>>>>
>>>>>   Date
>>>>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>>>>  Date
>>>>> column=stock_daily:high, timestamp=1475491189158, value=High
>>>>>  Date
>>>>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>>>>  Date
>>>>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>>>>  Date
>>>>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>>>  Date
>>>>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>>>  Date
>>>>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>>>>
>>>>> Sounds like the table header?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> I am not well versed with importtsv, but you can create a CSV file
>>>>>> using a simple spark program to create first column as ticker+tradedate. I
>>>>>> remember doing similar manipulation to create row key format in pig.
>>>>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Ayan,
>>>>>>>
>>>>>>> How do you specify ticker+rtrade as row key in the below
>>>>>>>
>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>
>>>>>>> I always thought that Hbase will take the first column as row key so
>>>>>>> it takes stock as the row key which is tsco plc for every row!
>>>>>>>
>>>>>>> Does row key need to be unique?
>>>>>>>
>>>>>>> cheers
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Mitch
>>>>>>>>
>>>>>>>> It is more to do with hbase than spark.
>>>>>>>>
>>>>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>>>>>> to be a good row key
>>>>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> thanks again.
>>>>>>>>>
>>>>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>>>>
>>>>>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>>>>>> able to interact with map-reduce job.
>>>>>>>>>
>>>>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>>>>> into Hbase with the following table format
>>>>>>>>>
>>>>>>>>> describe 'marketDataHbase'
>>>>>>>>> Table marketDataHbase is ENABLED
>>>>>>>>> marketDataHbase
>>>>>>>>> COLUMN FAMILIES DESCRIPTION
>>>>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>>>
>>>>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>>>>> first row!*
>>>>>>>>>
>>>>>>>>> scan 'tsco'
>>>>>>>>> ROW                                                    COLUMN+CELL
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>>>>
>>>>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for
>>>>>>>>> all? I thought that the row key can be anything.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>>>>>> it worked. We could read/write to HBase.
>>>>>>>>>>
>>>>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Ben
>>>>>>>>>>
>>>>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>>>>
>>>>>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any other
>>>>>>>>>> property which may arise from relying on this email's technical content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Mich,
>>>>>>>>>>>
>>>>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>>>>
>>>>>>>>>>> echo "/opt/cloudera/parcels/CDH/jar
>>>>>>>>>>> s/htrace-core-3.2.0-incubating.jar" >>
>>>>>>>>>>> /etc/spark/conf/classpath.txt
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>>>>
>>>>>>>>>>> import org.apache.spark._
>>>>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>>>>> HTableDescriptor}
>>>>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>>>>>> import org.apache.hadoop.mapreduce.Jo
>>>>>>>>>>> <http://org.apache.hadoop.mapreduce.jo/>b
>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>>>>
>>>>>>>>>>> So far no issues.
>>>>>>>>>>>
>>>>>>>>>>> Then I do
>>>>>>>>>>>
>>>>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>>>>> val tableName = "testTable"
>>>>>>>>>>> tableName: String = testTable
>>>>>>>>>>>
>>>>>>>>>>> ...

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Benjamin Kim <bb...@gmail.com>.

Lately, I’ve been experimenting with Kudu. It has been a much better experience than with HBase. Using it is much simpler, even from spark-shell.

spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0

It’s like going back to rudimentary DB systems where tables have just a primary key and the columns. Additional benefits include a home-grown spark package, fast upserts and table scans for analytics, time-series support just introduced, and (my favorite) simpler configuration and administration. It has just gone to version 1.0.0; so, I’m waiting for 1.0.1+ before I propose it as our HBase replacement for some bugs to shake out. All my performance tests have been stellar versus HBase especially with its simplicity.

Just a thought…

Cheers,
Ben


> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> I decided to create a composite key ticker-date from the csv file
> 
> I just did some manipulation on CSV file
> 
> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f; do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp tsco.csv
> 
> Which basically takes the csv file, tells the shell that field separator IFS=",", drops the header, reads every field in every line (1,b,c ..), creates the composite key TSCO-$a, adds the stock name and ticker to the csv file. The whole process can be automated and parameterised.
> 
> Once the csv file is put into HDFS then, I run the following command
> 
> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> 
> The Hbase table is created as below
> 
> create 'tsco','stock_info','stock_daily'
> 
> and this is the data (2 rows each 2 family and with 8 attributes)
> 
> hbase(main):132:0> scan 'tsco', LIMIT => 2
> ROW                                                    COLUMN+CELL
>  TSCO-1-Apr-08                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>  TSCO-1-Apr-08                                         column=stock_daily:close, timestamp=1475507091676, value=405.25
>  TSCO-1-Apr-08                                         column=stock_daily:high, timestamp=1475507091676, value=406.75
>  TSCO-1-Apr-08                                         column=stock_daily:low, timestamp=1475507091676, value=379.25
>  TSCO-1-Apr-08                                         column=stock_daily:open, timestamp=1475507091676, value=380.00
>  TSCO-1-Apr-08                                         column=stock_daily:volume, timestamp=1475507091676, value=49664486
>  TSCO-1-Apr-08                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-08                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>  
>  TSCO-1-Apr-09                                         column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>  TSCO-1-Apr-09                                         column=stock_daily:close, timestamp=1475507091676, value=333.30
>  TSCO-1-Apr-09                                         column=stock_daily:high, timestamp=1475507091676, value=334.60
>  TSCO-1-Apr-09                                         column=stock_daily:low, timestamp=1475507091676, value=326.50
>  TSCO-1-Apr-09                                         column=stock_daily:open, timestamp=1475507091676, value=331.10
>  TSCO-1-Apr-09                                         column=stock_daily:volume, timestamp=1475507091676, value=24877341
>  TSCO-1-Apr-09                                         column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-09                                         column=stock_info:ticker, timestamp=1475507091676, value=TSCO
> 
> Any suggestions
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 14:42, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> or may be add ticker+date like similar
> 
> 
> <image.png>
> 
> So the new row key would be TSCO-1-Apr-08 
> 
> and this will be added as row key. Both Date and ticker will stay as they are as column family attributes?
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 14:32, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> with ticker+date I can c reate something like below for row key
> 
> TSCO_1-Apr-08 
> 
> 
> or TSCO1-Apr-08
> 
> if I understood you correctly
>                     
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 13:13, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
> Hi
> 
> Looks like you are saving to new.csv but still loading tsco.csv? Its definitely the header.
> 
> Suggestion: ticker+date as row key has following benefits:
> 
> 1. using ticker+date as row key will enable you to hold multiple ticker in this single hbase table. (Think composite primary key)
> 2. Using date itself as row key will lead to hotspots (Look up hotspoting due to monotonically increasing row key). To distribute the load, it is suggested to use a salting. Ticker can be used as a natural salt in this case. 
> 3. Also, you may want to hash the rowkey value to give it little more flexible (Think surrogate key). 
> 
> 
> 
> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> Hi Ayan,
> 
> Sounds like the row key has to be unique much like a primary key in RDBMS
> 
> This is what I download as a csv for stock from Google Finance
> 
>   Date	Open	High	Low	Close	Volume
> 27-Sep-16	177.4	177.75	172.5	177.75	24117196
> 
> 
> So What I do I add the stock and ticker myself to end of the row via shell script and get rid of header
> 
> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' > new.csv
> 
> The New table has two column families: stock_price, stock_info and row key date (one row per date)
> 
> This creates a new csv file with two additional columns appended to the end of each line
> 
> Then I run the following command
> 
> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close, stock_daily:volume, stock_info:stock, stock_info:ticker" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> 
> This is in Hbase table for a given day
> 
> hbase(main):090:0> scan 'tsco', LIMIT => 10
> ROW                                                    COLUMN+CELL
>  1-Apr-08                                              column=stock_daily:close, timestamp=1475492248665, value=405.25
>  1-Apr-08                                              column=stock_daily:high, timestamp=1475492248665, value=406.75
>  1-Apr-08                                              column=stock_daily:low, timestamp=1475492248665, value=379.25
>  1-Apr-08                                              column=stock_daily:open, timestamp=1475492248665, value=380.00
>  1-Apr-08                                              column=stock_daily:volume, timestamp=1475492248665, value=49664486
>  1-Apr-08                                              column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>  1-Apr-08                                              column=stock_info:ticker, timestamp=1475492248665, value=TSCO
> 
>   
> But I also have this at the bottom
> 
>   Date                                                  column=stock_daily:close, timestamp=1475491189158, value=Close
>  Date                                                  column=stock_daily:high, timestamp=1475491189158, value=High
>  Date                                                  column=stock_daily:low, timestamp=1475491189158, value=Low
>  Date                                                  column=stock_daily:open, timestamp=1475491189158, value=Open
>  Date                                                  column=stock_daily:volume, timestamp=1475491189158, value=Volume
>  Date                                                  column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>  Date                                                  column=stock_info:ticker, timestamp=1475491189158, value=TSCO
> 
> Sounds like the table header?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 11:24, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
> I am not well versed with importtsv, but you can create a CSV file using a simple spark program to create first column as ticker+tradedate. I remember doing similar manipulation to create row key format in pig.
> 
> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> Thanks Ayan,
> 
> How do you specify ticker+rtrade as row key in the below
> 
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> 
> I always thought that Hbase will take the first column as row key so it takes stock as the row key which is tsco plc for every row!
> 
> Does row key need to be unique?
> 
> cheers
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 10:30, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
> Hi Mitch
> 
> It is more to do with hbase than spark.
> 
> Row key can be anything, yes but essentially what you are doing is insert and update tesco PLC row. Given your schema, ticker+trade date seems to be a good row key
> 
> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> thanks again.
> 
> I added that jar file to the classpath and that part worked.
> 
> I was using spark shell so I have to use spark-submit for it to be able to interact with map-reduce job.
> 
> BTW when I use the command line utility ImportTsv  to load a file into Hbase with the following table format
> 
> describe 'marketDataHbase'
> Table marketDataHbase is ENABLED
> marketDataHbase
> COLUMN FAMILIES DESCRIPTION
> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
> 1 row(s) in 0.0930 seconds
> 
> 
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> 
> There are with 1200 rows in the csv file, but it only loads the first row!
> 
> scan 'tsco'
> ROW                                                    COLUMN+CELL
>  Tesco PLC                                             column=stock_daily:close, timestamp=1475447365118, value=325.25
>  Tesco PLC                                             column=stock_daily:high, timestamp=1475447365118, value=332.00
>  Tesco PLC                                             column=stock_daily:low, timestamp=1475447365118, value=324.00
>  Tesco PLC                                             column=stock_daily:open, timestamp=1475447365118, value=331.75
>  Tesco PLC                                             column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>  Tesco PLC                                             column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>  Tesco PLC                                             column=stock_daily:volume, timestamp=1475447365118, value=46935045
> 1 row(s) in 0.0390 seconds
> 
> Is this because the hbase_row_key --> Tesco PLC is the same for all? I thought that the row key can be anything.
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 3 October 2016 at 07:44, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that Cloudera generated because it was customized to add jars first from paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it worked. We could read/write to HBase. 
> 
>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thanks Ben
>> 
>> The thing is I am using Spark 2 and no stack from CDH!
>> 
>> Is this approach to reading/writing to Hbase specific to Cloudera?
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 1 October 2016 at 23:39, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> Mich,
>> 
>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary.
>> 
>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt
>> 
>> Hope this helps.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Trying bulk load using Hfiles in Spark as below example:
>>> 
>>> import org.apache.spark._
>>> import org.apache.spark.rdd.NewHadoopRDD
>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>> import org.apache.hadoop.hbase.util.Bytes
>>> import org.apache.hadoop.hbase.client.Put;
>>> import org.apache.hadoop.hbase.client.HTable;
>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>> import org.apache.hadoop.mapred.JobConf
>>> import org.apache.hadoop.hbase.io <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable
>>> import org.apache.hadoop.mapreduce.Jo <http://org.apache.hadoop.mapreduce.jo/>b
>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.input.FileInputFormat
>>> import org.apache.hadoop.mapreduce.li <http://org.apache.hadoop.mapreduce.li/>b.output.FileOutputFormat
>>> import org.apache.hadoop.hbase.KeyValue
>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>> 
>>> So far no issues.
>>> 
>>> Then I do
>>> 
>>> val conf = HBaseConfiguration.create()
>>> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>> val tableName = "testTable"
>>> tableName: String = testTable
>>> 
>>> But this one fails:
>>> 
>>> scala> val table = new HTable(conf, tableName)
>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>   ... 52 elided
>>> Caused by: java.lang.reflect.InvocationTargetException: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>>>   ... 57 more
>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:648)
>>>   ... 62 more
>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>> 
>>> I have got all the jar files in spark-defaults.conf
>>> 
>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
>>> 
>>> 
>>> and also in Spark shell where I test the code
>>> 
>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
>>> 
>>> So any ideas will be appreciated.
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>> 
>> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
> 
> 
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

I decided to create a composite key *ticker-date* from the csv file

I just did some manipulation on CSV file

export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f;
do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp
tsco.csv

Which basically takes the csv file, tells the shell that field separator
IFS=",", drops the header, reads every field in every line (1,b,c ..),
creates the composite key TSCO-$a, adds the stock name and ticker to the
csv file. The whole process can be automated and parameterised.

Once the csv file is put into HDFS then, I run the following command

$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW
_KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto
ck_daily:open,stock_daily:high,stock_daily:low,stock_daily:c
lose,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv

The Hbase table is created as below

create 'tsco','stock_info','stock_daily'

and this is the data (2 rows each 2 family and with 8 attributes)

hbase(main):132:0> scan 'tsco', LIMIT => 2
ROW                                                    COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475507091676, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475507091676, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475507091676, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475507091676, value=380.00
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475507091676, value=49664486
 TSCO-1-Apr-08
column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_info:ticker, timestamp=1475507091676, value=TSCO

 TSCO-1-Apr-09
column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
 TSCO-1-Apr-09
column=stock_daily:close, timestamp=1475507091676, value=333.30
 TSCO-1-Apr-09
column=stock_daily:high, timestamp=1475507091676, value=334.60
 TSCO-1-Apr-09
column=stock_daily:low, timestamp=1475507091676, value=326.50
 TSCO-1-Apr-09
column=stock_daily:open, timestamp=1475507091676, value=331.10
 TSCO-1-Apr-09
column=stock_daily:volume, timestamp=1475507091676, value=24877341
 TSCO-1-Apr-09
column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
 TSCO-1-Apr-09
column=stock_info:ticker, timestamp=1475507091676, value=TSCO

Any suggestions

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 14:42, Mich Talebzadeh <mi...@gmail.com>
wrote:

> or may be add ticker+date like similar
>
>
> [image: Inline images 1]
>
> So the new row key would be TSCO-1-Apr-08
>
> and this will be added as row key. Both Date and ticker will stay as they
> are as column family attributes?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 14:32, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> with ticker+date I can c reate something like below for row key
>>
>> TSCO_1-Apr-08
>>
>>
>> or TSCO1-Apr-08
>>
>> if I understood you correctly
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 13:13, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Looks like you are saving to new.csv but still loading tsco.csv? Its
>>> definitely the header.
>>>
>>> Suggestion: ticker+date as row key has following benefits:
>>>
>>> 1. using ticker+date as row key will enable you to hold multiple ticker
>>> in this single hbase table. (Think composite primary key)
>>> 2. Using date itself as row key will lead to hotspots (Look up
>>> hotspoting due to monotonically increasing row key). To distribute the
>>> load, it is suggested to use a salting. Ticker can be used as a natural
>>> salt in this case.
>>> 3. Also, you may want to hash the rowkey value to give it little more
>>> flexible (Think surrogate key).
>>>
>>>
>>>
>>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi Ayan,
>>>>
>>>> Sounds like the row key has to be unique much like a primary key in
>>>> RDBMS
>>>>
>>>> This is what I download as a csv for stock from Google Finance
>>>>
>>>>   Date Open High Low Close Volume
>>>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>>>
>>>>
>>>> So What I do I add the stock and ticker myself to end of the row via
>>>> shell script and get rid of header
>>>>
>>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>>>> new.csv
>>>>
>>>> The New table has two column families: stock_price, stock_info and row
>>>> key date (one row per date)
>>>>
>>>> This creates a new csv file with two additional columns appended to the
>>>> end of each line
>>>>
>>>> Then I run the following command
>>>>
>>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>>>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>
>>>> This is in Hbase table for a given day
>>>>
>>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>>> ROW                                                    COLUMN+CELL
>>>>  1-Apr-08
>>>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>>  1-Apr-08
>>>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>>  1-Apr-08
>>>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>>  1-Apr-08
>>>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>>  1-Apr-08
>>>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>>  1-Apr-08
>>>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>>  1-Apr-08
>>>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>>>
>>>>
>>>> But I also have this at the bottom
>>>>
>>>>   Date
>>>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>>>  Date
>>>> column=stock_daily:high, timestamp=1475491189158, value=High
>>>>  Date
>>>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>>>  Date
>>>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>>>  Date
>>>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>>  Date
>>>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>>  Date
>>>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>>>
>>>> Sounds like the table header?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> I am not well versed with importtsv, but you can create a CSV file
>>>>> using a simple spark program to create first column as ticker+tradedate. I
>>>>> remember doing similar manipulation to create row key format in pig.
>>>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Ayan,
>>>>>>
>>>>>> How do you specify ticker+rtrade as row key in the below
>>>>>>
>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>
>>>>>> I always thought that Hbase will take the first column as row key so
>>>>>> it takes stock as the row key which is tsco plc for every row!
>>>>>>
>>>>>> Does row key need to be unique?
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Mitch
>>>>>>>
>>>>>>> It is more to do with hbase than spark.
>>>>>>>
>>>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>>>>> to be a good row key
>>>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks again.
>>>>>>>>
>>>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>>>
>>>>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>>>>> able to interact with map-reduce job.
>>>>>>>>
>>>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>>>> into Hbase with the following table format
>>>>>>>>
>>>>>>>> describe 'marketDataHbase'
>>>>>>>> Table marketDataHbase is ENABLED
>>>>>>>> marketDataHbase
>>>>>>>> COLUMN FAMILIES DESCRIPTION
>>>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>>>
>>>>>>>>
>>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>>
>>>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>>>> first row!*
>>>>>>>>
>>>>>>>> scan 'tsco'
>>>>>>>> ROW                                                    COLUMN+CELL
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>>>  Tesco PLC
>>>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>>>
>>>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for
>>>>>>>> all? I thought that the row key can be anything.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>>>>> it worked. We could read/write to HBase.
>>>>>>>>>
>>>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Ben
>>>>>>>>>
>>>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>>>
>>>>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Mich,
>>>>>>>>>>
>>>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>>>
>>>>>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>>>
>>>>>>>>>> import org.apache.spark._
>>>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>>>> HTableDescriptor}
>>>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>>>
>>>>>>>>>> So far no issues.
>>>>>>>>>>
>>>>>>>>>> Then I do
>>>>>>>>>>
>>>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>>>> val tableName = "testTable"
>>>>>>>>>> tableName: String = testTable
>>>>>>>>>>
>>>>>>>>>> But this one fails:
>>>>>>>>>>
>>>>>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>>>> ction(ConnectionFactory.java:240)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>>>> ction(ConnectionManager.java:431)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>>>> ction(ConnectionManager.java:424)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>>>>>> onInternal(ConnectionManager.java:302)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185
>>>>>>>>>> )
>>>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151
>>>>>>>>>> )
>>>>>>>>>>   ... 52 elided
>>>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>>> Method)
>>>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>>>>>> ConstructorAccessorImpl.java:62)
>>>>>>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>>>>>>>>>> 23)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>>>> ction(ConnectionFactory.java:238)
>>>>>>>>>>   ... 57 more
>>>>>>>>>> Caused by: java.lang.NoClassDefFoundError:
>>>>>>>>>> org/apache/htrace/Trace
>>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>>>>>> java:419)
>>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>>>>>> Node(ZKClusterId.java:65)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>>>>>>   ... 62 more
>>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>>>> org.apache.htrace.Trace
>>>>>>>>>>
>>>>>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>>>>>
>>>>>>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and also in Spark shell where I test the code
>>>>>>>>>>
>>>>>>>>>>  --jars /home/hduser/jars/hbase-client
>>>>>>>>>> -1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hd
>>>>>>>>>> user/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-pro
>>>>>>>>>> tocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/hom
>>>>>>>>>> e/hduser/jars/hive-hbase-handler-2.1.0.jar'
>>>>>>>>>>
>>>>>>>>>> So any ideas will be appreciated.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any other
>>>>>>>>>> property which may arise from relying on this email's technical content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

or may be add ticker+date like similar


[image: Inline images 1]

So the new row key would be TSCO-1-Apr-08

and this will be added as row key. Both Date and ticker will stay as they
are as column family attributes?



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 14:32, Mich Talebzadeh <mi...@gmail.com>
wrote:

> with ticker+date I can c reate something like below for row key
>
> TSCO_1-Apr-08
>
>
> or TSCO1-Apr-08
>
> if I understood you correctly
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 13:13, ayan guha <gu...@gmail.com> wrote:
>
>> Hi
>>
>> Looks like you are saving to new.csv but still loading tsco.csv? Its
>> definitely the header.
>>
>> Suggestion: ticker+date as row key has following benefits:
>>
>> 1. using ticker+date as row key will enable you to hold multiple ticker
>> in this single hbase table. (Think composite primary key)
>> 2. Using date itself as row key will lead to hotspots (Look up hotspoting
>> due to monotonically increasing row key). To distribute the load, it is
>> suggested to use a salting. Ticker can be used as a natural salt in this
>> case.
>> 3. Also, you may want to hash the rowkey value to give it little more
>> flexible (Think surrogate key).
>>
>>
>>
>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi Ayan,
>>>
>>> Sounds like the row key has to be unique much like a primary key in RDBMS
>>>
>>> This is what I download as a csv for stock from Google Finance
>>>
>>>   Date Open High Low Close Volume
>>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>>
>>>
>>> So What I do I add the stock and ticker myself to end of the row via
>>> shell script and get rid of header
>>>
>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>>> new.csv
>>>
>>> The New table has two column families: stock_price, stock_info and row
>>> key date (one row per date)
>>>
>>> This creates a new csv file with two additional columns appended to the
>>> end of each line
>>>
>>> Then I run the following command
>>>
>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>
>>> This is in Hbase table for a given day
>>>
>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>> ROW                                                    COLUMN+CELL
>>>  1-Apr-08
>>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>  1-Apr-08
>>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>  1-Apr-08
>>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>  1-Apr-08
>>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>  1-Apr-08
>>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>  1-Apr-08
>>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>  1-Apr-08
>>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>>
>>>
>>> But I also have this at the bottom
>>>
>>>   Date
>>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>>  Date
>>> column=stock_daily:high, timestamp=1475491189158, value=High
>>>  Date
>>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>>  Date
>>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>>  Date
>>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>  Date
>>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>  Date
>>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>>
>>> Sounds like the table header?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> I am not well versed with importtsv, but you can create a CSV file
>>>> using a simple spark program to create first column as ticker+tradedate. I
>>>> remember doing similar manipulation to create row key format in pig.
>>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Ayan,
>>>>>
>>>>> How do you specify ticker+rtrade as row key in the below
>>>>>
>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>
>>>>> I always thought that Hbase will take the first column as row key so
>>>>> it takes stock as the row key which is tsco plc for every row!
>>>>>
>>>>> Does row key need to be unique?
>>>>>
>>>>> cheers
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>>>
>>>>>> Hi Mitch
>>>>>>
>>>>>> It is more to do with hbase than spark.
>>>>>>
>>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>>>> to be a good row key
>>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> thanks again.
>>>>>>>
>>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>>
>>>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>>>> able to interact with map-reduce job.
>>>>>>>
>>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>>> into Hbase with the following table format
>>>>>>>
>>>>>>> describe 'marketDataHbase'
>>>>>>> Table marketDataHbase is ENABLED
>>>>>>> marketDataHbase
>>>>>>> COLUMN FAMILIES DESCRIPTION
>>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>>
>>>>>>>
>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>
>>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>>> first row!*
>>>>>>>
>>>>>>> scan 'tsco'
>>>>>>> ROW                                                    COLUMN+CELL
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>>  Tesco PLC
>>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>>
>>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for all?
>>>>>>> I thought that the row key can be anything.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>>
>>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>>>> it worked. We could read/write to HBase.
>>>>>>>>
>>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Thanks Ben
>>>>>>>>
>>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>>
>>>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Mich,
>>>>>>>>>
>>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>>
>>>>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>>>>
>>>>>>>>> Hope this helps.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>>
>>>>>>>>> import org.apache.spark._
>>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>>> HTableDescriptor}
>>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>>
>>>>>>>>> So far no issues.
>>>>>>>>>
>>>>>>>>> Then I do
>>>>>>>>>
>>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>>> val tableName = "testTable"
>>>>>>>>> tableName: String = testTable
>>>>>>>>>
>>>>>>>>> But this one fails:
>>>>>>>>>
>>>>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>>> ction(ConnectionFactory.java:240)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>>> ction(ConnectionManager.java:431)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>>> ction(ConnectionManager.java:424)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>>>>> onInternal(ConnectionManager.java:302)
>>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>>>>>   ... 52 elided
>>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>> Method)
>>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>>>>> ConstructorAccessorImpl.java:62)
>>>>>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>>>>>>>>> 23)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>>> ction(ConnectionFactory.java:238)
>>>>>>>>>   ... 57 more
>>>>>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>>>>> java:419)
>>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>>>>> Node(ZKClusterId.java:65)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>>>>>   ... 62 more
>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>>> org.apache.htrace.Trace
>>>>>>>>>
>>>>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>>>>
>>>>>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and also in Spark shell where I test the code
>>>>>>>>>
>>>>>>>>>  --jars /home/hduser/jars/hbase-client
>>>>>>>>> -1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hd
>>>>>>>>> user/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-pro
>>>>>>>>> tocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/hom
>>>>>>>>> e/hduser/jars/hive-hbase-handler-2.1.0.jar'
>>>>>>>>>
>>>>>>>>> So any ideas will be appreciated.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

with ticker+date I can c reate something like below for row key

TSCO_1-Apr-08


or TSCO1-Apr-08

if I understood you correctly


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 13:13, ayan guha <gu...@gmail.com> wrote:

> Hi
>
> Looks like you are saving to new.csv but still loading tsco.csv? Its
> definitely the header.
>
> Suggestion: ticker+date as row key has following benefits:
>
> 1. using ticker+date as row key will enable you to hold multiple ticker in
> this single hbase table. (Think composite primary key)
> 2. Using date itself as row key will lead to hotspots (Look up hotspoting
> due to monotonically increasing row key). To distribute the load, it is
> suggested to use a salting. Ticker can be used as a natural salt in this
> case.
> 3. Also, you may want to hash the rowkey value to give it little more
> flexible (Think surrogate key).
>
>
>
> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi Ayan,
>>
>> Sounds like the row key has to be unique much like a primary key in RDBMS
>>
>> This is what I download as a csv for stock from Google Finance
>>
>>   Date Open High Low Close Volume
>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>
>>
>> So What I do I add the stock and ticker myself to end of the row via
>> shell script and get rid of header
>>
>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>> new.csv
>>
>> The New table has two column families: stock_price, stock_info and row
>> key date (one row per date)
>>
>> This creates a new csv file with two additional columns appended to the
>> end of each line
>>
>> Then I run the following command
>>
>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>
>> This is in Hbase table for a given day
>>
>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>> ROW                                                    COLUMN+CELL
>>  1-Apr-08
>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>  1-Apr-08
>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>  1-Apr-08
>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>  1-Apr-08
>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>  1-Apr-08
>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>  1-Apr-08
>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>  1-Apr-08
>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>
>>
>> But I also have this at the bottom
>>
>>   Date
>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>  Date
>> column=stock_daily:high, timestamp=1475491189158, value=High
>>  Date
>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>  Date
>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>  Date
>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>  Date
>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>  Date
>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>
>> Sounds like the table header?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>>
>>> I am not well versed with importtsv, but you can create a CSV file using
>>> a simple spark program to create first column as ticker+tradedate. I
>>> remember doing similar manipulation to create row key format in pig.
>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ayan,
>>>>
>>>> How do you specify ticker+rtrade as row key in the below
>>>>
>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>
>>>> I always thought that Hbase will take the first column as row key so it
>>>> takes stock as the row key which is tsco plc for every row!
>>>>
>>>> Does row key need to be unique?
>>>>
>>>> cheers
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> Hi Mitch
>>>>>
>>>>> It is more to do with hbase than spark.
>>>>>
>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>>> to be a good row key
>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> thanks again.
>>>>>>
>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>
>>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>>> able to interact with map-reduce job.
>>>>>>
>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>> into Hbase with the following table format
>>>>>>
>>>>>> describe 'marketDataHbase'
>>>>>> Table marketDataHbase is ENABLED
>>>>>> marketDataHbase
>>>>>> COLUMN FAMILIES DESCRIPTION
>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>
>>>>>>
>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>
>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>> first row!*
>>>>>>
>>>>>> scan 'tsco'
>>>>>> ROW                                                    COLUMN+CELL
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>  Tesco PLC
>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>
>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for all?
>>>>>> I thought that the row key can be anything.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>
>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>>> it worked. We could read/write to HBase.
>>>>>>>
>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>> Thanks Ben
>>>>>>>
>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>
>>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Mich,
>>>>>>>>
>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>
>>>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>>>
>>>>>>>> Hope this helps.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>
>>>>>>>> import org.apache.spark._
>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>> HTableDescriptor}
>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>
>>>>>>>> So far no issues.
>>>>>>>>
>>>>>>>> Then I do
>>>>>>>>
>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>> val tableName = "testTable"
>>>>>>>> tableName: String = testTable
>>>>>>>>
>>>>>>>> But this one fails:
>>>>>>>>
>>>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>> ction(ConnectionFactory.java:240)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>> ction(ConnectionManager.java:431)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>>> ction(ConnectionManager.java:424)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>>>> onInternal(ConnectionManager.java:302)
>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>>>>   ... 52 elided
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>>>> ConstructorAccessorImpl.java:62)
>>>>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:4
>>>>>>>> 23)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>>> ction(ConnectionFactory.java:238)
>>>>>>>>   ... 57 more
>>>>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>>>> java:419)
>>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>>>> Node(ZKClusterId.java:65)
>>>>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>>>>   ... 62 more
>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>> org.apache.htrace.Trace
>>>>>>>>
>>>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>>>
>>>>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>>
>>>>>>>>
>>>>>>>> and also in Spark shell where I test the code
>>>>>>>>
>>>>>>>>  --jars /home/hduser/jars/hbase-client
>>>>>>>> -1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hd
>>>>>>>> user/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-pro
>>>>>>>> tocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/hom
>>>>>>>> e/hduser/jars/hive-hbase-handler-2.1.0.jar'
>>>>>>>>
>>>>>>>> So any ideas will be appreciated.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by ayan guha <gu...@gmail.com>.

Hi

Looks like you are saving to new.csv but still loading tsco.csv? Its
definitely the header.

Suggestion: ticker+date as row key has following benefits:

1. using ticker+date as row key will enable you to hold multiple ticker in
this single hbase table. (Think composite primary key)
2. Using date itself as row key will lead to hotspots (Look up hotspoting
due to monotonically increasing row key). To distribute the load, it is
suggested to use a salting. Ticker can be used as a natural salt in this
case.
3. Also, you may want to hash the rowkey value to give it little more
flexible (Think surrogate key).



On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Ayan,
>
> Sounds like the row key has to be unique much like a primary key in RDBMS
>
> This is what I download as a csv for stock from Google Finance
>
>   Date Open High Low Close Volume
> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>
>
> So What I do I add the stock and ticker myself to end of the row via shell
> script and get rid of header
>
> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
> new.csv
>
> The New table has two column families: stock_price, stock_info and row key
> date (one row per date)
>
> This creates a new csv file with two additional columns appended to the
> end of each line
>
> Then I run the following command
>
> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
> hdfs://rhes564:9000/data/stocks/tsco.csv
>
> This is in Hbase table for a given day
>
> hbase(main):090:0> scan 'tsco', LIMIT => 10
> ROW                                                    COLUMN+CELL
>  1-Apr-08
> column=stock_daily:close, timestamp=1475492248665, value=405.25
>  1-Apr-08
> column=stock_daily:high, timestamp=1475492248665, value=406.75
>  1-Apr-08
> column=stock_daily:low, timestamp=1475492248665, value=379.25
>  1-Apr-08
> column=stock_daily:open, timestamp=1475492248665, value=380.00
>  1-Apr-08
> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>  1-Apr-08
> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>  1-Apr-08
> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>
>
> But I also have this at the bottom
>
>   Date
> column=stock_daily:close, timestamp=1475491189158, value=Close
>  Date
> column=stock_daily:high, timestamp=1475491189158, value=High
>  Date
> column=stock_daily:low, timestamp=1475491189158, value=Low
>  Date
> column=stock_daily:open, timestamp=1475491189158, value=Open
>  Date
> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>  Date
> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>  Date
> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>
> Sounds like the table header?
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:
>
>> I am not well versed with importtsv, but you can create a CSV file using
>> a simple spark program to create first column as ticker+tradedate. I
>> remember doing similar manipulation to create row key format in pig.
>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com> wrote:
>>
>>> Thanks Ayan,
>>>
>>> How do you specify ticker+rtrade as row key in the below
>>>
>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>
>>> I always thought that Hbase will take the first column as row key so it
>>> takes stock as the row key which is tsco plc for every row!
>>>
>>> Does row key need to be unique?
>>>
>>> cheers
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>>
>>>> Hi Mitch
>>>>
>>>> It is more to do with hbase than spark.
>>>>
>>>> Row key can be anything, yes but essentially what you are doing is
>>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>>> to be a good row key
>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>>> wrote:
>>>>
>>>>> thanks again.
>>>>>
>>>>> I added that jar file to the classpath and that part worked.
>>>>>
>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>> able to interact with map-reduce job.
>>>>>
>>>>> BTW when I use the command line utility ImportTsv  to load a file into
>>>>> Hbase with the following table format
>>>>>
>>>>> describe 'marketDataHbase'
>>>>> Table marketDataHbase is ENABLED
>>>>> marketDataHbase
>>>>> COLUMN FAMILIES DESCRIPTION
>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>> 1 row(s) in 0.0930 seconds
>>>>>
>>>>>
>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>
>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>> first row!*
>>>>>
>>>>> scan 'tsco'
>>>>> ROW                                                    COLUMN+CELL
>>>>>  Tesco PLC
>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>  Tesco PLC
>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>  Tesco PLC
>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>  Tesco PLC
>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>  Tesco PLC
>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>  Tesco PLC
>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>  Tesco PLC
>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>> 1 row(s) in 0.0390 seconds
>>>>>
>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
>>>>> thought that the row key can be anything.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>
>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>>> that Cloudera generated because it was customized to add jars first from
>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>>> it worked. We could read/write to HBase.
>>>>>>
>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>> Thanks Ben
>>>>>>
>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>
>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>
>>>>>>> Mich,
>>>>>>>
>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>> classpath to make it work using the command below. But after upgrading to
>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>
>>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>
>>>>>>> import org.apache.spark._
>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>> HTableDescriptor}
>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>
>>>>>>> So far no issues.
>>>>>>>
>>>>>>> Then I do
>>>>>>>
>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>> val tableName = "testTable"
>>>>>>> tableName: String = testTable
>>>>>>>
>>>>>>> But this one fails:
>>>>>>>
>>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>> ction(ConnectionFactory.java:240)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>> ction(ConnectionManager.java:431)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>>> ction(ConnectionManager.java:424)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>>> onInternal(ConnectionManager.java:302)
>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>>>   ... 52 elided
>>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>> Method)
>>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>>> ConstructorAccessorImpl.java:62)
>>>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>>> ction(ConnectionFactory.java:238)
>>>>>>>   ... 57 more
>>>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>>> java:419)
>>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>>> Node(ZKClusterId.java:65)
>>>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>>>   ... 62 more
>>>>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>>>>>
>>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>>
>>>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>>
>>>>>>>
>>>>>>> and also in Spark shell where I test the code
>>>>>>>
>>>>>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>>>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j
>>>>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j
>>>>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl
>>>>>>> er-2.1.0.jar'
>>>>>>>
>>>>>>> So any ideas will be appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>


-- 
Best Regards,
Ayan Guha

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Ayan,

Sounds like the row key has to be unique much like a primary key in RDBMS

This is what I download as a csv for stock from Google Finance

  Date Open High Low Close Volume
27-Sep-16 177.4 177.75 172.5 177.75 24117196


So What I do I add the stock and ticker myself to end of the row via shell
script and get rid of header

sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
new.csv

The New table has two column families: stock_price, stock_info and row key
date (one row per date)

This creates a new csv file with two additional columns appended to the end
of each line

Then I run the following command

$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
hdfs://rhes564:9000/data/stocks/tsco.csv

This is in Hbase table for a given day

hbase(main):090:0> scan 'tsco', LIMIT => 10
ROW                                                    COLUMN+CELL
 1-Apr-08
column=stock_daily:close, timestamp=1475492248665, value=405.25
 1-Apr-08
column=stock_daily:high, timestamp=1475492248665, value=406.75
 1-Apr-08
column=stock_daily:low, timestamp=1475492248665, value=379.25
 1-Apr-08
column=stock_daily:open, timestamp=1475492248665, value=380.00
 1-Apr-08
column=stock_daily:volume, timestamp=1475492248665, value=49664486
 1-Apr-08
column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
 1-Apr-08
column=stock_info:ticker, timestamp=1475492248665, value=TSCO


But I also have this at the bottom

  Date
column=stock_daily:close, timestamp=1475491189158, value=Close
 Date
column=stock_daily:high, timestamp=1475491189158, value=High
 Date
column=stock_daily:low, timestamp=1475491189158, value=Low
 Date
column=stock_daily:open, timestamp=1475491189158, value=Open
 Date
column=stock_daily:volume, timestamp=1475491189158, value=Volume
 Date
column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
 Date
column=stock_info:ticker, timestamp=1475491189158, value=TSCO

Sounds like the table header?









Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 11:24, ayan guha <gu...@gmail.com> wrote:

> I am not well versed with importtsv, but you can create a CSV file using a
> simple spark program to create first column as ticker+tradedate. I remember
> doing similar manipulation to create row key format in pig.
> On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com> wrote:
>
>> Thanks Ayan,
>>
>> How do you specify ticker+rtrade as row key in the below
>>
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>
>> I always thought that Hbase will take the first column as row key so it
>> takes stock as the row key which is tsco plc for every row!
>>
>> Does row key need to be unique?
>>
>> cheers
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi Mitch
>>>
>>> It is more to do with hbase than spark.
>>>
>>> Row key can be anything, yes but essentially what you are doing is
>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>> to be a good row key
>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com>
>>> wrote:
>>>
>>>> thanks again.
>>>>
>>>> I added that jar file to the classpath and that part worked.
>>>>
>>>> I was using spark shell so I have to use spark-submit for it to be able
>>>> to interact with map-reduce job.
>>>>
>>>> BTW when I use the command line utility ImportTsv  to load a file into
>>>> Hbase with the following table format
>>>>
>>>> describe 'marketDataHbase'
>>>> Table marketDataHbase is ENABLED
>>>> marketDataHbase
>>>> COLUMN FAMILIES DESCRIPTION
>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY
>>>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE',
>>>> TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>> 1 row(s) in 0.0930 seconds
>>>>
>>>>
>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>
>>>> There are with 1200 rows in the csv file,* but it only loads the first
>>>> row!*
>>>>
>>>> scan 'tsco'
>>>> ROW                                                    COLUMN+CELL
>>>>  Tesco PLC
>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>  Tesco PLC
>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>  Tesco PLC
>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>  Tesco PLC
>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>  Tesco PLC
>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>  Tesco PLC
>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>  Tesco PLC
>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>> 1 row(s) in 0.0390 seconds
>>>>
>>>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
>>>> thought that the row key can be anything.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>>>
>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>> that Cloudera generated because it was customized to add jars first from
>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>> it worked. We could read/write to HBase.
>>>>>
>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> Thanks Ben
>>>>>
>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>
>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>
>>>>>> Mich,
>>>>>>
>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath
>>>>>> to make it work using the command below. But after upgrading to CDH 5.7, it
>>>>>> became unnecessary.
>>>>>>
>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>
>>>>>> import org.apache.spark._
>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>> import org.apache.hadoop.fs.Path;
>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>
>>>>>> So far no issues.
>>>>>>
>>>>>> Then I do
>>>>>>
>>>>>> val conf = HBaseConfiguration.create()
>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>> val tableName = "testTable"
>>>>>> tableName: String = testTable
>>>>>>
>>>>>> But this one fails:
>>>>>>
>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>> ction(ConnectionFactory.java:240)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>> ction(ConnectionManager.java:431)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>> ction(ConnectionManager.java:424)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>> onInternal(ConnectionManager.java:302)
>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>>   ... 52 elided
>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>> Method)
>>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>> ConstructorAccessorImpl.java:62)
>>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>> ction(ConnectionFactory.java:238)
>>>>>>   ... 57 more
>>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>> java:419)
>>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>> Node(ZKClusterId.java:65)
>>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>>   ... 62 more
>>>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>>>>
>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>
>>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>
>>>>>>
>>>>>> and also in Spark shell where I test the code
>>>>>>
>>>>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j
>>>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j
>>>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl
>>>>>> er-2.1.0.jar'
>>>>>>
>>>>>> So any ideas will be appreciated.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by ayan guha <gu...@gmail.com>.

I am not well versed with importtsv, but you can create a CSV file using a
simple spark program to create first column as ticker+tradedate. I remember
doing similar manipulation to create row key format in pig.
On 3 Oct 2016 20:40, "Mich Talebzadeh" <mi...@gmail.com> wrote:

> Thanks Ayan,
>
> How do you specify ticker+rtrade as row key in the below
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:
> high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
> hdfs://rhes564:9000/data/stocks/tsco.csv
>
> I always thought that Hbase will take the first column as row key so it
> takes stock as the row key which is tsco plc for every row!
>
> Does row key need to be unique?
>
> cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:
>
>> Hi Mitch
>>
>> It is more to do with hbase than spark.
>>
>> Row key can be anything, yes but essentially what you are doing is insert
>> and update tesco PLC row. Given your schema, ticker+trade date seems to be
>> a good row key
>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com> wrote:
>>
>>> thanks again.
>>>
>>> I added that jar file to the classpath and that part worked.
>>>
>>> I was using spark shell so I have to use spark-submit for it to be able
>>> to interact with map-reduce job.
>>>
>>> BTW when I use the command line utility ImportTsv  to load a file into
>>> Hbase with the following table format
>>>
>>> describe 'marketDataHbase'
>>> Table marketDataHbase is ENABLED
>>> marketDataHbase
>>> COLUMN FAMILIES DESCRIPTION
>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY
>>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE',
>>> TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>> 1 row(s) in 0.0930 seconds
>>>
>>>
>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>
>>> There are with 1200 rows in the csv file,* but it only loads the first
>>> row!*
>>>
>>> scan 'tsco'
>>> ROW                                                    COLUMN+CELL
>>>  Tesco PLC
>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>  Tesco PLC
>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>  Tesco PLC
>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>  Tesco PLC
>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>  Tesco PLC
>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>  Tesco PLC
>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>  Tesco PLC
>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>> 1 row(s) in 0.0390 seconds
>>>
>>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
>>> thought that the row key can be anything.
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because
>>>> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark
>>>> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that
>>>> Cloudera generated because it was customized to add jars first from paths
>>>> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path
>>>> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it
>>>> worked. We could read/write to HBase.
>>>>
>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Ben
>>>>
>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>
>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>>>
>>>>> Mich,
>>>>>
>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath
>>>>> to make it work using the command below. But after upgrading to CDH 5.7, it
>>>>> became unnecessary.
>>>>>
>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>> >> /etc/spark/conf/classpath.txt
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>
>>>>> import org.apache.spark._
>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>> import org.apache.hadoop.fs.Path;
>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>> import org.apache.hadoop.mapred.JobConf
>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>> import org.apache.hadoop.mapreduce.Job
>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>
>>>>> So far no issues.
>>>>>
>>>>> Then I do
>>>>>
>>>>> val conf = HBaseConfiguration.create()
>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>> val tableName = "testTable"
>>>>> tableName: String = testTable
>>>>>
>>>>> But this one fails:
>>>>>
>>>>> scala> val table = new HTable(conf, tableName)
>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>> ction(ConnectionFactory.java:240)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>> ction(ConnectionManager.java:431)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>> ction(ConnectionManager.java:424)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>> onInternal(ConnectionManager.java:302)
>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>   ... 52 elided
>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>> Method)
>>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>> ConstructorAccessorImpl.java:62)
>>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>> ction(ConnectionFactory.java:238)
>>>>>   ... 57 more
>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>> s(RecoverableZooKeeper.java:216)
>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>> java:419)
>>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>> Node(ZKClusterId.java:65)
>>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>> d(ZooKeeperRegistry.java:105)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>   ... 62 more
>>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>>>
>>>>> I have got all the jar files in spark-defaults.conf
>>>>>
>>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>
>>>>>
>>>>> and also in Spark shell where I test the code
>>>>>
>>>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j
>>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j
>>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl
>>>>> er-2.1.0.jar'
>>>>>
>>>>> So any ideas will be appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Ayan,

How do you specify ticker+rtrade as row key in the below

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker,
stock_daily:tradedate,
stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume"
tsco hdfs://rhes564:9000/data/stocks/tsco.csv

I always thought that Hbase will take the first column as row key so it
takes stock as the row key which is tsco plc for every row!

Does row key need to be unique?

cheers


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 10:30, ayan guha <gu...@gmail.com> wrote:

> Hi Mitch
>
> It is more to do with hbase than spark.
>
> Row key can be anything, yes but essentially what you are doing is insert
> and update tesco PLC row. Given your schema, ticker+trade date seems to be
> a good row key
> On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com> wrote:
>
>> thanks again.
>>
>> I added that jar file to the classpath and that part worked.
>>
>> I was using spark shell so I have to use spark-submit for it to be able
>> to interact with map-reduce job.
>>
>> BTW when I use the command line utility ImportTsv  to load a file into
>> Hbase with the following table format
>>
>> describe 'marketDataHbase'
>> Table marketDataHbase is ENABLED
>> marketDataHbase
>> COLUMN FAMILIES DESCRIPTION
>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY
>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE',
>> TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>> 1 row(s) in 0.0930 seconds
>>
>>
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>
>> There are with 1200 rows in the csv file,* but it only loads the first
>> row!*
>>
>> scan 'tsco'
>> ROW                                                    COLUMN+CELL
>>  Tesco PLC
>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>  Tesco PLC
>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>  Tesco PLC
>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>  Tesco PLC
>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>  Tesco PLC
>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>  Tesco PLC
>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>  Tesco PLC
>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>> 1 row(s) in 0.0390 seconds
>>
>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
>> thought that the row key can be anything.
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>>
>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because
>>> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark
>>> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that
>>> Cloudera generated because it was customized to add jars first from paths
>>> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path
>>> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it
>>> worked. We could read/write to HBase.
>>>
>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Thanks Ben
>>>
>>> The thing is I am using Spark 2 and no stack from CDH!
>>>
>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>>> Mich,
>>>>
>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath
>>>> to make it work using the command below. But after upgrading to CDH 5.7, it
>>>> became unnecessary.
>>>>
>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>> >> /etc/spark/conf/classpath.txt
>>>>
>>>> Hope this helps.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>
>>>> import org.apache.spark._
>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>> import org.apache.hadoop.hbase.util.Bytes
>>>> import org.apache.hadoop.hbase.client.Put;
>>>> import org.apache.hadoop.hbase.client.HTable;
>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>> import org.apache.hadoop.mapred.JobConf
>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>> import org.apache.hadoop.mapreduce.Job
>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>> import org.apache.hadoop.hbase.KeyValue
>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>
>>>> So far no issues.
>>>>
>>>> Then I do
>>>>
>>>> val conf = HBaseConfiguration.create()
>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>> val tableName = "testTable"
>>>> tableName: String = testTable
>>>>
>>>> But this one fails:
>>>>
>>>> scala> val table = new HTable(conf, tableName)
>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>> ction(ConnectionFactory.java:240)
>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>> ction(ConnectionManager.java:431)
>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>> ction(ConnectionManager.java:424)
>>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>> onInternal(ConnectionManager.java:302)
>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>   ... 52 elided
>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> Method)
>>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>> ConstructorAccessorImpl.java:62)
>>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>> legatingConstructorAccessorImpl.java:45)
>>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>> ction(ConnectionFactory.java:238)
>>>>   ... 57 more
>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>> s(RecoverableZooKeeper.java:216)
>>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>> java:419)
>>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>> Node(ZKClusterId.java:65)
>>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>> d(ZooKeeperRegistry.java:105)
>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>   ... 62 more
>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>>
>>>> I have got all the jar files in spark-defaults.conf
>>>>
>>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>
>>>>
>>>> and also in Spark shell where I test the code
>>>>
>>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j
>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j
>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl
>>>> er-2.1.0.jar'
>>>>
>>>> So any ideas will be appreciated.
>>>>
>>>> Thanks
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by ayan guha <gu...@gmail.com>.

Hi Mitch

It is more to do with hbase than spark.

Row key can be anything, yes but essentially what you are doing is insert
and update tesco PLC row. Given your schema, ticker+trade date seems to be
a good row key
On 3 Oct 2016 18:25, "Mich Talebzadeh" <mi...@gmail.com> wrote:

> thanks again.
>
> I added that jar file to the classpath and that part worked.
>
> I was using spark shell so I have to use spark-submit for it to be able to
> interact with map-reduce job.
>
> BTW when I use the command line utility ImportTsv  to load a file into
> Hbase with the following table format
>
> describe 'marketDataHbase'
> Table marketDataHbase is ENABLED
> marketDataHbase
> COLUMN FAMILIES DESCRIPTION
> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL
> => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
> 1 row(s) in 0.0930 seconds
>
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:
> high,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
> hdfs://rhes564:9000/data/stocks/tsco.csv
>
> There are with 1200 rows in the csv file,* but it only loads the first
> row!*
>
> scan 'tsco'
> ROW                                                    COLUMN+CELL
>  Tesco PLC
> column=stock_daily:close, timestamp=1475447365118, value=325.25
>  Tesco PLC
> column=stock_daily:high, timestamp=1475447365118, value=332.00
>  Tesco PLC
> column=stock_daily:low, timestamp=1475447365118, value=324.00
>  Tesco PLC
> column=stock_daily:open, timestamp=1475447365118, value=331.75
>  Tesco PLC
> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>  Tesco PLC
> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>  Tesco PLC
> column=stock_daily:volume, timestamp=1475447365118, value=46935045
> 1 row(s) in 0.0390 seconds
>
> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
> thought that the row key can be anything.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:
>
>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because
>> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark
>> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that
>> Cloudera generated because it was customized to add jars first from paths
>> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path
>> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it
>> worked. We could read/write to HBase.
>>
>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Thanks Ben
>>
>> The thing is I am using Spark 2 and no stack from CDH!
>>
>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to
>>> make it work using the command below. But after upgrading to CDH 5.7, it
>>> became unnecessary.
>>>
>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>> >> /etc/spark/conf/classpath.txt
>>>
>>> Hope this helps.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Trying bulk load using Hfiles in Spark as below example:
>>>
>>> import org.apache.spark._
>>> import org.apache.spark.rdd.NewHadoopRDD
>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>> import org.apache.hadoop.hbase.util.Bytes
>>> import org.apache.hadoop.hbase.client.Put;
>>> import org.apache.hadoop.hbase.client.HTable;
>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>> import org.apache.hadoop.mapred.JobConf
>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>> import org.apache.hadoop.mapreduce.Job
>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>> import org.apache.hadoop.hbase.KeyValue
>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>
>>> So far no issues.
>>>
>>> Then I do
>>>
>>> val conf = HBaseConfiguration.create()
>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>> val tableName = "testTable"
>>> tableName: String = testTable
>>>
>>> But this one fails:
>>>
>>> scala> val table = new HTable(conf, tableName)
>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>> ction(ConnectionFactory.java:240)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>> ction(ConnectionManager.java:431)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>> ction(ConnectionManager.java:424)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>> onInternal(ConnectionManager.java:302)
>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>   ... 52 elided
>>> Caused by: java.lang.reflect.InvocationTargetException:
>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>> ConstructorAccessorImpl.java:62)
>>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>> legatingConstructorAccessorImpl.java:45)
>>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>> ction(ConnectionFactory.java:238)
>>>   ... 57 more
>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>> s(RecoverableZooKeeper.java:216)
>>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>> java:419)
>>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>> Node(ZKClusterId.java:65)
>>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>> d(ZooKeeperRegistry.java:105)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>> Implementation.<init>(ConnectionManager.java:648)
>>>   ... 62 more
>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>
>>> I have got all the jar files in spark-defaults.conf
>>>
>>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>
>>>
>>> and also in Spark shell where I test the code
>>>
>>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.
>>> jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/
>>> jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-
>>> handler-2.1.0.jar'
>>>
>>> So any ideas will be appreciated.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

thanks again.

I added that jar file to the classpath and that part worked.

I was using spark shell so I have to use spark-submit for it to be able to
interact with map-reduce job.

BTW when I use the command line utility ImportTsv  to load a file into
Hbase with the following table format

describe 'marketDataHbase'
Table marketDataHbase is ENABLED
marketDataHbase
COLUMN FAMILIES DESCRIPTION
{NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL
=> 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0930 seconds


hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY, stock_daily:ticker,
stock_daily:tradedate,
stock_daily:open,stock_daily:high,stock_daily:low,stock_daily:close,stock_daily:volume"
tsco hdfs://rhes564:9000/data/stocks/tsco.csv

There are with 1200 rows in the csv file,* but it only loads the first row!*

scan 'tsco'
ROW                                                    COLUMN+CELL
 Tesco PLC
column=stock_daily:close, timestamp=1475447365118, value=325.25
 Tesco PLC
column=stock_daily:high, timestamp=1475447365118, value=332.00
 Tesco PLC
column=stock_daily:low, timestamp=1475447365118, value=324.00
 Tesco PLC
column=stock_daily:open, timestamp=1475447365118, value=331.75
 Tesco PLC
column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
 Tesco PLC
column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
 Tesco PLC
column=stock_daily:volume, timestamp=1475447365118, value=46935045
1 row(s) in 0.0390 seconds

Is this because the hbase_row_key --> Tesco PLC is the same for all? I
thought that the row key can be anything.





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 October 2016 at 07:44, Benjamin Kim <bb...@gmail.com> wrote:

> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because
> Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark
> 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that
> Cloudera generated because it was customized to add jars first from paths
> listed in the file /etc/spark/conf/classpath.txt. So, we entered the path
> for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it
> worked. We could read/write to HBase.
>
> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks Ben
>
> The thing is I am using Spark 2 and no stack from CDH!
>
> Is this approach to reading/writing to Hbase specific to Cloudera?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:
>
>> Mich,
>>
>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to
>> make it work using the command below. But after upgrading to CDH 5.7, it
>> became unnecessary.
>>
>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>> >> /etc/spark/conf/classpath.txt
>>
>> Hope this helps.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Trying bulk load using Hfiles in Spark as below example:
>>
>> import org.apache.spark._
>> import org.apache.spark.rdd.NewHadoopRDD
>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>> import org.apache.hadoop.hbase.client.HBaseAdmin
>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.hbase.HColumnDescriptor
>> import org.apache.hadoop.hbase.util.Bytes
>> import org.apache.hadoop.hbase.client.Put;
>> import org.apache.hadoop.hbase.client.HTable;
>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>> import org.apache.hadoop.mapred.JobConf
>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>> import org.apache.hadoop.mapreduce.Job
>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>> import org.apache.hadoop.hbase.KeyValue
>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>
>> So far no issues.
>>
>> Then I do
>>
>> val conf = HBaseConfiguration.create()
>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>> val tableName = "testTable"
>> tableName: String = testTable
>>
>> But this one fails:
>>
>> scala> val table = new HTable(conf, tableName)
>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>> ction(ConnectionFactory.java:240)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>> ction(ConnectionManager.java:431)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>> ction(ConnectionManager.java:424)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>> onInternal(ConnectionManager.java:302)
>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>   ... 52 elided
>> Caused by: java.lang.reflect.InvocationTargetException:
>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>> ConstructorAccessorImpl.java:62)
>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>> legatingConstructorAccessorImpl.java:45)
>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>> ction(ConnectionFactory.java:238)
>>   ... 57 more
>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.
>> exists(RecoverableZooKeeper.java:216)
>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>> java:419)
>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>> Node(ZKClusterId.java:65)
>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>> d(ZooKeeperRegistry.java:105)
>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>> Implementation.<init>(ConnectionManager.java:648)
>>   ... 62 more
>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>
>> I have got all the jar files in spark-defaults.conf
>>
>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-pr
>> otocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/
>> home/hduser/jars/hive-hbase-handler-2.1.0.jar
>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-pr
>> otocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/
>> home/hduser/jars/hive-hbase-handler-2.1.0.jar
>>
>>
>> and also in Spark shell where I test the code
>>
>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/
>> hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.
>> 3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/
>> hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-
>> hbase-handler-2.1.0.jar'
>>
>> So any ideas will be appreciated.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Benjamin Kim <bb...@gmail.com>.

We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8 because Cloudera only had Spark 1.3.0 at the time, and we wanted to use Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file that Cloudera generated because it was customized to add jars first from paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then, it worked. We could read/write to HBase. 

> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Thanks Ben
> 
> The thing is I am using Spark 2 and no stack from CDH!
> 
> Is this approach to reading/writing to Hbase specific to Cloudera?
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 1 October 2016 at 23:39, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Mich,
> 
> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary.
> 
> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt
> 
> Hope this helps.
> 
> Cheers,
> Ben
> 
> 
>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Trying bulk load using Hfiles in Spark as below example:
>> 
>> import org.apache.spark._
>> import org.apache.spark.rdd.NewHadoopRDD
>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>> import org.apache.hadoop.hbase.client.HBaseAdmin
>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.hbase.HColumnDescriptor
>> import org.apache.hadoop.hbase.util.Bytes
>> import org.apache.hadoop.hbase.client.Put;
>> import org.apache.hadoop.hbase.client.HTable;
>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>> import org.apache.hadoop.mapred.JobConf
>> import org.apache.hadoop.hbase.io <http://org.apache.hadoop.hbase.io/>.ImmutableBytesWritable
>> import org.apache.hadoop.mapreduce.Job
>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>> import org.apache.hadoop.hbase.KeyValue
>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>> 
>> So far no issues.
>> 
>> Then I do
>> 
>> val conf = HBaseConfiguration.create()
>> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>> val tableName = "testTable"
>> tableName: String = testTable
>> 
>> But this one fails:
>> 
>> scala> val table = new HTable(conf, tableName)
>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
>>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>   ... 52 elided
>> Caused by: java.lang.reflect.InvocationTargetException: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>>   ... 57 more
>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
>>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:648)
>>   ... 62 more
>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>> 
>> I have got all the jar files in spark-defaults.conf
>> 
>> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
>> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
>> 
>> 
>> and also in Spark shell where I test the code
>> 
>>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
>> 
>> So any ideas will be appreciated.
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
> 
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Ben

The thing is I am using Spark 2 and no stack from CDH!

Is this approach to reading/writing to Hbase specific to Cloudera?





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 October 2016 at 23:39, Benjamin Kim <bb...@gmail.com> wrote:

> Mich,
>
> I know up until CDH 5.4 we had to add the HTrace jar to the classpath to
> make it work using the command below. But after upgrading to CDH 5.7, it
> became unnecessary.
>
> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >>
> /etc/spark/conf/classpath.txt
>
> Hope this helps.
>
> Cheers,
> Ben
>
>
> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Trying bulk load using Hfiles in Spark as below example:
>
> import org.apache.spark._
> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.mapred.TableOutputFormat
> import org.apache.hadoop.mapred.JobConf
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.mapreduce.Job
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
> import org.apache.hadoop.hbase.KeyValue
> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>
> So far no issues.
>
> Then I do
>
> val conf = HBaseConfiguration.create()
> conf: org.apache.hadoop.conf.Configuration = Configuration:
> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
> val tableName = "testTable"
> tableName: String = testTable
>
> But this one fails:
>
> scala> val table = new HTable(conf, tableName)
> java.io.IOException: java.lang.reflect.InvocationTargetException
>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(
> ConnectionFactory.java:240)
>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(
> ConnectionManager.java:431)
>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(
> ConnectionManager.java:424)
>   at org.apache.hadoop.hbase.client.ConnectionManager.
> getConnectionInternal(ConnectionManager.java:302)
>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>   ... 52 elided
> Caused by: java.lang.reflect.InvocationTargetException: java.lang.NoClassDefFoundError:
> org/apache/htrace/Trace
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(
> ConnectionFactory.java:238)
>   ... 57 more
> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(
> RecoverableZooKeeper.java:216)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(
> ZKClusterId.java:65)
>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(
> ZooKeeperRegistry.java:105)
>   at org.apache.hadoop.hbase.client.ConnectionManager$
> HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>   at org.apache.hadoop.hbase.client.ConnectionManager$
> HConnectionImplementation.<init>(ConnectionManager.java:648)
>   ... 62 more
> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>
> I have got all the jar files in spark-defaults.conf
>
> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-
> 1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/
> hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-
> protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.
> jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/
> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-
> 1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/
> hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-
> protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.
> jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
>
>
> and also in Spark shell where I test the code
>
>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/
> jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-
> 1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/
> home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/
> hive-hbase-handler-2.1.0.jar'
>
> So any ideas will be appreciated.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

Posted by Benjamin Kim <bb...@gmail.com>.

Mich,

I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary.

echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt

Hope this helps.

Cheers,
Ben


> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Trying bulk load using Hfiles in Spark as below example:
> 
> import org.apache.spark._
> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.client.HBaseAdmin
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.mapred.TableOutputFormat
> import org.apache.hadoop.mapred.JobConf
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.mapreduce.Job
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
> import org.apache.hadoop.hbase.KeyValue
> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
> 
> So far no issues.
> 
> Then I do
> 
> val conf = HBaseConfiguration.create()
> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
> val tableName = "testTable"
> tableName: String = testTable
> 
> But this one fails:
> 
> scala> val table = new HTable(conf, tableName)
> java.io.IOException: java.lang.reflect.InvocationTargetException
>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:431)
>   at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
>   at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:302)
>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>   at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>   ... 52 elided
> Caused by: java.lang.reflect.InvocationTargetException: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>   ... 57 more
> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>   at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:216)
>   at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:419)
>   at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>   at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:905)
>   at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:648)
>   ... 62 more
> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
> 
> I have got all the jar files in spark-defaults.conf
> 
> spark.driver.extraClassPath      /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> spark.executor.extraClassPath    /home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1.2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hduser/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-protocol-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/hduser/jars/hive-hbase-handler-2.1.0.jar
> 
> 
> and also in Spark shell where I test the code
> 
>  --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/hbase-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.jar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/jars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handler-2.1.0.jar'
> 
> So any ideas will be appreciated.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>