You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by tgh <gu...@ia.ac.cn> on 2012/12/06 03:51:40 UTC

how to store 100billion short text messages with hbase

Hi
	I try to use hbase to store 100billion short texts messages, each
message has less than 1000 character and some other items, that is, each
messages has less than 10 items,
	The whole data is a stream for about one year, and I want to create
multi tables to store these data, I have two ideas, the one is to store the
data in one hour in one table, and for one year data, there are 365*24
tables, the other is to store the date in one day in one table, and for one
year , there are 365 tables,

	And I have about 15 computer nodes to handle these data, and I want
to know how to deal with these data, the one for 365*24 tables , or the one
for 365 tables, or other better ideas, 

	I am really confused about hbase, it is powerful yet a bit complex
for me , is it?
	Could you give me some advice for hbase data schema and others,
	Could you help me,


Thank you
---------------------------------
Tian Guanhua







Re: 答复: how to store 100billion short text messages with hbase

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hello,

If you want to use Lucene....why not use Lucene, or one of the fancy search
servers built on top of it - Solr(Cloud), ElasticSearch, or SenseiDB?
You can easily shard the index by time, lookup by key, and search using
full-text search with results sorted by some key value or relevance to the
query.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Wed, Dec 5, 2012 at 10:28 PM, tgh <gu...@ia.ac.cn> wrote:

> Thank you for your reply
>
> And I want to access the data with lucene search engine, that is, with key
> to retrieve any message, and I also want to get one hour data together, so
> I
> think to split data table into one hour , or if I can store it in one big
> table, is it better than store in 365 table or store in 365*24 table, which
> one is best for my data access schema, and I am also confused about how to
> make secondary index in hbase , if I have use some key words search engine
> ,
> lucene or other
>
>
> Could you help me
> Thank you
>
> -------------
> Tian Guanhua
>
>
>
> -----邮件原件-----
> 发件人: user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org
> [mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
> Varley
> 发送时间: 2012年12月6日 11:01
> 收件人: user@hbase.apache.org
> 主题: Re: how to store 100billion short text messages with hbase
>
> Tian,
>
> The best way to think about how to structure your data in HBase is to ask
> the question: "How will I access it?". Perhaps you could reply with the
> sorts of queries you expect to be able to do over this data? For example,
> retrieve any single conversation between two people in < 10 ms; or show all
> conversations that happened in a single hour, regardless of participants.
> HBase only gives you fast GET/SCAN access along a single "primary" key (the
> row key) so you must choose it carefully, or else duplicate & denormalize
> your data for fast access.
>
> Your data size seems reasonable (but not overwhelming) for HBase. 100B
> messages x 1K bytes per message on average comes out to 100TB. That, plus
> 3x
> replication in HDFS, means you need roughly 300TB of space. If you have 13
> nodes (taking out 2 for redundant master services) that's a requirement for
> about 23T of space per server. That's a lot, even these days. Did I get all
> that math right?
>
> On your question about multiple tables: a table in HBase is only a
> namespace
> for rowkeys, and a container for a set of regions. If it's a homogenous
> data
> set, there's no advantage to breaking the table into multiple tables;
> that's
> what regions within the table are for.
>
> Ian
>
> ps - Please don't cross post to both dev@ and user@.
>
> On Dec 5, 2012, at 8:51 PM, tgh wrote:
>
> > Hi
> >       I try to use hbase to store 100billion short texts messages, each
> > message has less than 1000 character and some other items, that is,
> > each messages has less than 10 items,
> >       The whole data is a stream for about one year, and I want to create
> > multi tables to store these data, I have two ideas, the one is to
> > store the data in one hour in one table, and for one year data, there
> > are 365*24 tables, the other is to store the date in one day in one
> > table, and for one year , there are 365 tables,
> >
> >       And I have about 15 computer nodes to handle these data, and I want
> > to know how to deal with these data, the one for 365*24 tables , or
> > the one for 365 tables, or other better ideas,
> >
> >       I am really confused about hbase, it is powerful yet a bit complex
> > for me , is it?
> >       Could you give me some advice for hbase data schema and others,
> >       Could you help me,
> >
> >
> > Thank you
> > ---------------------------------
> > Tian Guanhua
> >
> >
> >
> >
> >
> >
>
>
>

答复: how config multi regionserver, or what is wrong?

Posted by tgh <gu...@ia.ac.cn>.
Meanwhile , log is master , that is, blade1, is like this,  there are some ERRor like this, for  

2012-09-01 06:31:05,558 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=blade2,60020,1346452716636, regionCount=0, userLoad=false
2012-09-01 06:31:05,569 WARN org.apache.hadoop.hbase.master.ServerManager: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
2012-09-01 06:31:05,581 WARN org.apache.hadoop.hbase.master.ServerManager: Server blade5,60020,1346452001672 has been rejected; Reported time is too far out of sync with master.  Time difference of 263137ms > max allowed of 30000ms
2012-09-01 06:31:05,583 ERROR org.apache.hadoop.hbase.master.HMaster: Region server serverName=blade4,60020,1346451768443, load=(requests=0, regions=0, usedHeap=142, maxHeap=966) reported a fatal error:
ABORTING region server serverName=blade4,60020,1346451768443, load=(requests=0, regions=0, usedHeap=142, maxHeap=966): Unhandled exception: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
Cause:
org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1574)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1531)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:572)
        at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
        at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:193)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:141)
        at org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:675)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy5.regionServerStartup(Unknown Source)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1570)
        ... 3 more


-----邮件原件-----
发件人: user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
发送时间: 2012年12月11日 9:59
收件人: user@hbase.apache.org
主题: 答复: how config multi regionserver, or what is wrong?

And our hosts is follows

[root@blade1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.76.233 blade1
192.168.76.234 blade2
192.168.76.235 blade3
192.168.76.236 blade4
192.168.76.237 blade5
192.168.76.238 blade6
192.168.76.239 blade7
192.168.76.240 blade8

192.168.76.245 fnode1
192.168.76.246 fnode2
[root@blade1 ~]#

-----邮件原件-----
发件人: user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
发送时间: 2012年12月11日 9:00
收件人: user@hbase.apache.org
主题: 答复: how config multi regionserver, or what is wrong?

Thank you for your reply,
And the configuration file is here, 
Could you help me,


Thank you
---------------------------
Tian Guanhua



[root@blade1 conf]# cat regionservers 
blade1
blade2
blade3
blade4
blade5
blade6
blade7
blade8
[root@blade1 conf]#
[root@blade1 conf]# vim hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- 
-->
<configuration>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://blade1:9000/hbase</value>
    <description>The directory shared by RegionServers.</description>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>blade1,blade2,blade3</value>
</property>
 <property>
    <name>hbase.zookeeper.property.dataDir</name>
   <value>/home/liuxin/zookeeper/data</value>
 </property>
 <property>
  <name>dfs.support.append</name>
   <value>true</value>
 </property>
 <property> 
  <name>dfs.datanode.max.xcievers</name> 
   <value>4096</value>
   </property>
 <property> 
 <name>hbase.master</name> 
 <value>blade1:60000</value> 
 </property> 
</configuration>






-----邮件原件-----
发件人: user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Jean-Marc Spaggiari
发送时间: 2012年12月10日 20:54
收件人: user@hbase.apache.org
主题: Re: how config multi regionserver, or what is wrong?

Hi Tian,

Can you share you configuration files?

Do you have something like that on your hbase-site.xml file?

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>


JM

2012/12/10, tgh <gu...@ia.ac.cn>:
> Hi
> 	I try to use hbase, and now ,I have a problem with hbase 
> configuration, I use 8node for try, and it seems to work, hadoop, 
> zookeeper, hbase all boot up, and it could insert into with putAPI , 
> But when I try to use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
> 	I have set regionserver file , there are 8 nodes there, and all could 
> ssh to each other without passwd,
>
> 	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK, the 
> data have been balanced across 8node, but I wander hbase only have one 
> regionserver to work, although when I star/stop regionserver , there 
> are 8region server to start and stop,
>
> 	And I try to put data into hbase, it seems ok at first, but after 
> 200million in hbase, it seems really hard to insert more into it, it 
> is very slow, and I use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
>
> 	
> 	Could you help me,
>
> 	
>
> Thank you
> -------------------
> Tian Guanhua
>
>
>
>
>
>
>





Re: 答复: how config multi regionserver, or what is wrong?

Posted by Leonid Fedotov <lf...@hortonworks.com>.
Your nodes are too much out of sync with time.
> Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
You need to set up time synchronization service for your cluster.

Thank you!

Sincerely,
Leonid Fedotov
Hortonworks support team


On Dec 10, 2012, at 6:17 PM, tgh wrote:

> Meanwhile , log is master , that is, blade1, is like this,  there are some ERRor like this, for  
> 
> 2012-09-01 06:31:05,558 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=blade2,60020,1346452716636, regionCount=0, userLoad=false
> 2012-09-01 06:31:05,569 WARN org.apache.hadoop.hbase.master.ServerManager: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
> 2012-09-01 06:31:05,581 WARN org.apache.hadoop.hbase.master.ServerManager: Server blade5,60020,1346452001672 has been rejected; Reported time is too far out of sync with master.  Time difference of 263137ms > max allowed of 30000ms
> 2012-09-01 06:31:05,583 ERROR org.apache.hadoop.hbase.master.HMaster: Region server serverName=blade4,60020,1346451768443, load=(requests=0, regions=0, usedHeap=142, maxHeap=966) reported a fatal error:
> ABORTING region server serverName=blade4,60020,1346451768443, load=(requests=0, regions=0, usedHeap=142, maxHeap=966): Unhandled exception: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
> Cause:
> org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1574)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1531)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:572)
>        at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server blade4,60020,1346451768443 has been rejected; Reported time is too far out of sync with master.  Time difference of 496371ms > max allowed of 30000ms
>        at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:193)
>        at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:141)
>        at org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:675)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:601)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> 
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy5.regionServerStartup(Unknown Source)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1570)
>        ... 3 more
> 
> 
> -----邮件原件-----
> 发件人: user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
> 发送时间: 2012年12月11日 9:59
> 收件人: user@hbase.apache.org
> 主题: 答复: how config multi regionserver, or what is wrong?
> 
> And our hosts is follows
> 
> [root@blade1 ~]# cat /etc/hosts
> 127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
> ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
> 
> 192.168.76.233 blade1
> 192.168.76.234 blade2
> 192.168.76.235 blade3
> 192.168.76.236 blade4
> 192.168.76.237 blade5
> 192.168.76.238 blade6
> 192.168.76.239 blade7
> 192.168.76.240 blade8
> 
> 192.168.76.245 fnode1
> 192.168.76.246 fnode2
> [root@blade1 ~]#
> 
> -----邮件原件-----
> 发件人: user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
> 发送时间: 2012年12月11日 9:00
> 收件人: user@hbase.apache.org
> 主题: 答复: how config multi regionserver, or what is wrong?
> 
> Thank you for your reply,
> And the configuration file is here, 
> Could you help me,
> 
> 
> Thank you
> ---------------------------
> Tian Guanhua
> 
> 
> 
> [root@blade1 conf]# cat regionservers 
> blade1
> blade2
> blade3
> blade4
> blade5
> blade6
> blade7
> blade8
> [root@blade1 conf]#
> [root@blade1 conf]# vim hbase-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- 
> -->
> <configuration>
> <property>
>    <name>hbase.rootdir</name>
>    <value>hdfs://blade1:9000/hbase</value>
>    <description>The directory shared by RegionServers.</description>
> </property>
> <property>
>    <name>hbase.cluster.distributed</name>
>    <value>true</value>
> </property>
> <property>
>    <name>hbase.zookeeper.quorum</name>
>    <value>blade1,blade2,blade3</value>
> </property>
> <property>
>    <name>hbase.zookeeper.property.dataDir</name>
>   <value>/home/liuxin/zookeeper/data</value>
> </property>
> <property>
>  <name>dfs.support.append</name>
>   <value>true</value>
> </property>
> <property> 
>  <name>dfs.datanode.max.xcievers</name> 
>   <value>4096</value>
>   </property>
> <property> 
> <name>hbase.master</name> 
> <value>blade1:60000</value> 
> </property> 
> </configuration>
> 
> 
> 
> 
> 
> 
> -----邮件原件-----
> 发件人: user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Jean-Marc Spaggiari
> 发送时间: 2012年12月10日 20:54
> 收件人: user@hbase.apache.org
> 主题: Re: how config multi regionserver, or what is wrong?
> 
> Hi Tian,
> 
> Can you share you configuration files?
> 
> Do you have something like that on your hbase-site.xml file?
> 
>  <property>
>    <name>hbase.cluster.distributed</name>
>    <value>true</value>
>    <description>The mode the cluster will be in. Possible values are
>      false: standalone and pseudo-distributed setups with managed Zookeeper
>      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
>    </description>
>  </property>
> 
> 
> JM
> 
> 2012/12/10, tgh <gu...@ia.ac.cn>:
>> Hi
>> 	I try to use hbase, and now ,I have a problem with hbase 
>> configuration, I use 8node for try, and it seems to work, hadoop, 
>> zookeeper, hbase all boot up, and it could insert into with putAPI , 
>> But when I try to use masterIP:60010 to manage it, I find there are 
>> only one regionserver there, and it is localhost:60030, why?
>> 	I have set regionserver file , there are 8 nodes there, and all could 
>> ssh to each other without passwd,
>> 
>> 	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK, the 
>> data have been balanced across 8node, but I wander hbase only have one 
>> regionserver to work, although when I star/stop regionserver , there 
>> are 8region server to start and stop,
>> 
>> 	And I try to put data into hbase, it seems ok at first, but after 
>> 200million in hbase, it seems really hard to insert more into it, it 
>> is very slow, and I use masterIP:60010 to manage it, I find there are 
>> only one regionserver there, and it is localhost:60030, why?
>> 
>> 	
>> 	Could you help me,
>> 
>> 	
>> 
>> Thank you
>> -------------------
>> Tian Guanhua
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 


答复: how config multi regionserver, or what is wrong?

Posted by tgh <gu...@ia.ac.cn>.
And our hosts is follows

[root@blade1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.76.233 blade1
192.168.76.234 blade2
192.168.76.235 blade3
192.168.76.236 blade4
192.168.76.237 blade5
192.168.76.238 blade6
192.168.76.239 blade7
192.168.76.240 blade8

192.168.76.245 fnode1
192.168.76.246 fnode2
[root@blade1 ~]#

-----邮件原件-----
发件人: user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
发送时间: 2012年12月11日 9:00
收件人: user@hbase.apache.org
主题: 答复: how config multi regionserver, or what is wrong?

Thank you for your reply,
And the configuration file is here, 
Could you help me,


Thank you
---------------------------
Tian Guanhua



[root@blade1 conf]# cat regionservers 
blade1
blade2
blade3
blade4
blade5
blade6
blade7
blade8
[root@blade1 conf]#
[root@blade1 conf]# vim hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- 
-->
<configuration>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://blade1:9000/hbase</value>
    <description>The directory shared by RegionServers.</description>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>blade1,blade2,blade3</value>
</property>
 <property>
    <name>hbase.zookeeper.property.dataDir</name>
   <value>/home/liuxin/zookeeper/data</value>
 </property>
 <property>
  <name>dfs.support.append</name>
   <value>true</value>
 </property>
 <property> 
  <name>dfs.datanode.max.xcievers</name> 
   <value>4096</value>
   </property>
 <property> 
 <name>hbase.master</name> 
 <value>blade1:60000</value> 
 </property> 
</configuration>






-----邮件原件-----
发件人: user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Jean-Marc Spaggiari
发送时间: 2012年12月10日 20:54
收件人: user@hbase.apache.org
主题: Re: how config multi regionserver, or what is wrong?

Hi Tian,

Can you share you configuration files?

Do you have something like that on your hbase-site.xml file?

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>


JM

2012/12/10, tgh <gu...@ia.ac.cn>:
> Hi
> 	I try to use hbase, and now ,I have a problem with hbase 
> configuration, I use 8node for try, and it seems to work, hadoop, 
> zookeeper, hbase all boot up, and it could insert into with putAPI , 
> But when I try to use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
> 	I have set regionserver file , there are 8 nodes there, and all could 
> ssh to each other without passwd,
>
> 	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK, the 
> data have been balanced across 8node, but I wander hbase only have one 
> regionserver to work, although when I star/stop regionserver , there 
> are 8region server to start and stop,
>
> 	And I try to put data into hbase, it seems ok at first, but after 
> 200million in hbase, it seems really hard to insert more into it, it 
> is very slow, and I use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
>
> 	
> 	Could you help me,
>
> 	
>
> Thank you
> -------------------
> Tian Guanhua
>
>
>
>
>
>
>




答复: how config multi regionserver, or what is wrong?

Posted by tgh <gu...@ia.ac.cn>.
Thank you for your reply,
And the configuration file is here, 
Could you help me,


Thank you
---------------------------
Tian Guanhua



[root@blade1 conf]# cat regionservers 
blade1
blade2
blade3
blade4
blade5
blade6
blade7
blade8
[root@blade1 conf]#
[root@blade1 conf]# vim hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- 
-->
<configuration>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://blade1:9000/hbase</value>
    <description>The directory shared by RegionServers.</description>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>blade1,blade2,blade3</value>
</property>
 <property>
    <name>hbase.zookeeper.property.dataDir</name>
   <value>/home/liuxin/zookeeper/data</value>
 </property>
 <property>
  <name>dfs.support.append</name>
   <value>true</value>
 </property>
 <property> 
  <name>dfs.datanode.max.xcievers</name> 
   <value>4096</value>
   </property>
 <property> 
 <name>hbase.master</name> 
 <value>blade1:60000</value> 
 </property> 
</configuration>






-----邮件原件-----
发件人: user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Jean-Marc Spaggiari
发送时间: 2012年12月10日 20:54
收件人: user@hbase.apache.org
主题: Re: how config multi regionserver, or what is wrong?

Hi Tian,

Can you share you configuration files?

Do you have something like that on your hbase-site.xml file?

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>


JM

2012/12/10, tgh <gu...@ia.ac.cn>:
> Hi
> 	I try to use hbase, and now ,I have a problem with hbase 
> configuration, I use 8node for try, and it seems to work, hadoop, 
> zookeeper, hbase all boot up, and it could insert into with putAPI , 
> But when I try to use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
> 	I have set regionserver file , there are 8 nodes there, and all could 
> ssh to each other without passwd,
>
> 	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK, the 
> data have been balanced across 8node, but I wander hbase only have one 
> regionserver to work, although when I star/stop regionserver , there 
> are 8region server to start and stop,
>
> 	And I try to put data into hbase, it seems ok at first, but after 
> 200million in hbase, it seems really hard to insert more into it, it 
> is very slow, and I use masterIP:60010 to manage it, I find there are 
> only one regionserver there, and it is localhost:60030, why?
>
> 	
> 	Could you help me,
>
> 	
>
> Thank you
> -------------------
> Tian Guanhua
>
>
>
>
>
>
>



Re: 答复: how config multi regionserver, or what is wrong?

Posted by Nick Dimiduk <nd...@gmail.com>.
For starters, you'll need to sync up the clocks on all your machines.
Install ntp or similar and those ClockOutOfSync exceptions will clear up.
Specifically, blade4 looks off.

-n
On Dec 10, 2012 6:18 PM, "tgh" <gu...@ia.ac.cn> wrote:

> Meanwhile , log is master , that is, blade1, is like this,  there are some
> ERRor like this, for
>
> 2012-09-01 06:31:05,558 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=blade2,60020,1346452716636, regionCount=0, userLoad=false
> 2012-09-01 06:31:05,569 WARN org.apache.hadoop.hbase.master.ServerManager:
> Server blade4,60020,1346451768443 has been rejected; Reported time is too
> far out of sync with master.  Time difference of 496371ms > max allowed of
> 30000ms
> 2012-09-01 06:31:05,581 WARN org.apache.hadoop.hbase.master.ServerManager:
> Server blade5,60020,1346452001672 has been rejected; Reported time is too
> far out of sync with master.  Time difference of 263137ms > max allowed of
> 30000ms
> 2012-09-01 06:31:05,583 ERROR org.apache.hadoop.hbase.master.HMaster:
> Region server serverName=blade4,60020,1346451768443, load=(requests=0,
> regions=0, usedHeap=142, maxHeap=966) reported a fatal error:
> ABORTING region server serverName=blade4,60020,1346451768443,
> load=(requests=0, regions=0, usedHeap=142, maxHeap=966): Unhandled
> exception: org.apache.hadoop.hbase.ClockOutOfSyncException: Server
> blade4,60020,1346451768443 has been rejected; Reported time is too far out
> of sync with master.  Time difference of 496371ms > max allowed of 30000ms
> Cause:
> org.apache.hadoop.hbase.ClockOutOfSyncException:
> org.apache.hadoop.hbase.ClockOutOfSyncException: Server
> blade4,60020,1346451768443 has been rejected; Reported time is too far out
> of sync with master.  Time difference of 496371ms > max allowed of 30000ms
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>         at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>         at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1574)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1531)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:572)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hbase.ClockOutOfSyncException: Server
> blade4,60020,1346451768443 has been rejected; Reported time is too far out
> of sync with master.  Time difference of 496371ms > max allowed of 30000ms
>         at
> org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:193)
>         at
> org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:141)
>         at
> org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:675)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy5.regionServerStartup(Unknown Source)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1570)
>         ... 3 more
>
>
> -----邮件原件-----
> 发件人: user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:
> user-return-32385-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
> 发送时间: 2012年12月11日 9:59
> 收件人: user@hbase.apache.org
> 主题: 答复: how config multi regionserver, or what is wrong?
>
> And our hosts is follows
>
> [root@blade1 ~]# cat /etc/hosts
> 127.0.0.1   localhost localhost.localdomain localhost4
> localhost4.localdomain4
> ::1         localhost localhost.localdomain localhost6
> localhost6.localdomain6
>
> 192.168.76.233 blade1
> 192.168.76.234 blade2
> 192.168.76.235 blade3
> 192.168.76.236 blade4
> 192.168.76.237 blade5
> 192.168.76.238 blade6
> 192.168.76.239 blade7
> 192.168.76.240 blade8
>
> 192.168.76.245 fnode1
> 192.168.76.246 fnode2
> [root@blade1 ~]#
>
> -----邮件原件-----
> 发件人: user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:
> user-return-32384-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
> 发送时间: 2012年12月11日 9:00
> 收件人: user@hbase.apache.org
> 主题: 答复: how config multi regionserver, or what is wrong?
>
> Thank you for your reply,
> And the configuration file is here,
> Could you help me,
>
>
> Thank you
> ---------------------------
> Tian Guanhua
>
>
>
> [root@blade1 conf]# cat regionservers
> blade1
> blade2
> blade3
> blade4
> blade5
> blade6
> blade7
> blade8
> [root@blade1 conf]#
> [root@blade1 conf]# vim hbase-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!--
> -->
> <configuration>
> <property>
>     <name>hbase.rootdir</name>
>     <value>hdfs://blade1:9000/hbase</value>
>     <description>The directory shared by RegionServers.</description>
> </property>
> <property>
>     <name>hbase.cluster.distributed</name>
>     <value>true</value>
> </property>
> <property>
>     <name>hbase.zookeeper.quorum</name>
>     <value>blade1,blade2,blade3</value>
> </property>
>  <property>
>     <name>hbase.zookeeper.property.dataDir</name>
>    <value>/home/liuxin/zookeeper/data</value>
>  </property>
>  <property>
>   <name>dfs.support.append</name>
>    <value>true</value>
>  </property>
>  <property>
>   <name>dfs.datanode.max.xcievers</name>
>    <value>4096</value>
>    </property>
>  <property>
>  <name>hbase.master</name>
>  <value>blade1:60000</value>
>  </property>
> </configuration>
>
>
>
>
>
>
> -----邮件原件-----
> 发件人: user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org [mailto:
> user-return-32370-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Jean-Marc
> Spaggiari
> 发送时间: 2012年12月10日 20:54
> 收件人: user@hbase.apache.org
> 主题: Re: how config multi regionserver, or what is wrong?
>
> Hi Tian,
>
> Can you share you configuration files?
>
> Do you have something like that on your hbase-site.xml file?
>
>   <property>
>     <name>hbase.cluster.distributed</name>
>     <value>true</value>
>     <description>The mode the cluster will be in. Possible values are
>       false: standalone and pseudo-distributed setups with managed
> Zookeeper
>       true: fully-distributed with unmanaged Zookeeper Quorum (see
> hbase-env.sh)
>     </description>
>   </property>
>
>
> JM
>
> 2012/12/10, tgh <gu...@ia.ac.cn>:
> > Hi
> >       I try to use hbase, and now ,I have a problem with hbase
> > configuration, I use 8node for try, and it seems to work, hadoop,
> > zookeeper, hbase all boot up, and it could insert into with putAPI ,
> > But when I try to use masterIP:60010 to manage it, I find there are
> > only one regionserver there, and it is localhost:60030, why?
> >       I have set regionserver file , there are 8 nodes there, and all
> could
> > ssh to each other without passwd,
> >
> >       And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK,
> the
> > data have been balanced across 8node, but I wander hbase only have one
> > regionserver to work, although when I star/stop regionserver , there
> > are 8region server to start and stop,
> >
> >       And I try to put data into hbase, it seems ok at first, but after
> > 200million in hbase, it seems really hard to insert more into it, it
> > is very slow, and I use masterIP:60010 to manage it, I find there are
> > only one regionserver there, and it is localhost:60030, why?
> >
> >
> >       Could you help me,
> >
> >
> >
> > Thank you
> > -------------------
> > Tian Guanhua
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
>

Re: how config multi regionserver, or what is wrong?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Tian,

Can you share you configuration files?

Do you have something like that on your hbase-site.xml file?

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>


JM

2012/12/10, tgh <gu...@ia.ac.cn>:
> Hi
> 	I try to use hbase, and now ,I have a problem with hbase
> configuration, I use 8node for try, and it seems to work, hadoop,
> zookeeper,
> hbase all boot up, and it could insert into with putAPI ,
> But when I try to use masterIP:60010 to manage it, I find there are only
> one
> regionserver there, and it is localhost:60030, why?
> 	I have set regionserver file , there are 8 nodes there, and all
> could ssh to each other without passwd,
>
> 	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK,
> the data have been balanced across 8node, but I wander hbase only have one
> regionserver to work, although when I star/stop regionserver , there are
> 8region server to start and stop,
>
> 	And I try to put data into hbase, it seems ok at first, but after
> 200million in hbase, it seems really hard to insert more into it, it is
> very
> slow, and I use masterIP:60010 to manage it, I find there are only one
> regionserver there, and it is localhost:60030, why?
>
> 	
> 	Could you help me,
>
> 	
>
> Thank you
> -------------------
> Tian Guanhua
>
>
>
>
>
>
>

how config multi regionserver, or what is wrong?

Posted by tgh <gu...@ia.ac.cn>.
Hi
	I try to use hbase, and now ,I have a problem with hbase
configuration, I use 8node for try, and it seems to work, hadoop, zookeeper,
hbase all boot up, and it could insert into with putAPI ,
But when I try to use masterIP:60010 to manage it, I find there are only one
regionserver there, and it is localhost:60030, why?
	I have set regionserver file , there are 8 nodes there, and all
could ssh to each other without passwd,

	And I also use hdfsNamenodeIP:50070 to see HDFS, and it seems OK,
the data have been balanced across 8node, but I wander hbase only have one
regionserver to work, although when I star/stop regionserver , there are
8region server to start and stop,

	And I try to put data into hbase, it seems ok at first, but after
200million in hbase, it seems really hard to insert more into it, it is very
slow, and I use masterIP:60010 to manage it, I find there are only one
regionserver there, and it is localhost:60030, why?

	
	Could you help me,
 
	

Thank you
-------------------
Tian Guanhua







Re: 答复: 答复: how to store 100billion short text messages with hbase

Posted by Ian Varley <iv...@salesforce.com>.
Tian,

By sharding the table manually along the time dimension (which is what you're talking about: 365 * 24 different tables, one per hour), you can reduce the amount of data any one query has to deal with, because you can instruct your query to only go to the right table. However, that's (roughly) the same effect you'd get by making the time dimension the first part of your row key in HBase, and allowing HBase to do that "sharding" work for you, into Regions. The whole point of HBase is that it's a horizontally scalable database, which handles sharding of arbitrarily large (Petabyte-size) data sets into smaller, more manageable chunks called regions, and then manages running those regions smoothly even when machines fail. If you're going to do all of that yourself, you'd be better off using something like MySQL.

(I say "roughly" above because by default, HBase won't choose nice even boundaries like a single hour for your region boundaries, so a query that wants to scan over an hour's worth of data might have to hit two regions instead of one; but that won't make much difference (in fact, it'll improve performance because the scan can be performed in parallel on the two region servers). You can also change that behavior by implementing a custom region split policy (see HBASE-5304<https://issues.apache.org/jira/browse/HBASE-5304>); but you shouldn't need to do that; functionally, it's the same thing.)

If you're still confused about why regions are better than performing the sharding yourself, I'd recommend reading the links I sent in the previous email.

Ian

On Dec 6, 2012, at 2:01 AM, tgh wrote:

Meanwhile, we need lucene to retrieve message with keywords or content in
message, after NLP parse processing, and do it without timestamp or
messageID, it is time critical operation,
And we do read one hour data, not with lucene, but with table name, if we
use timestamp about hour as tablename , such as 2012120612 as for table of
data for 12clock on Des 12 2012, and it is about 100million to 200million
messages in table, it is not very time critical operation,
And if we have 365*24table for one year , does it work , or if we put one
year data in ONE table, will it work more faster than multi tables, and why?
How does hbase manage ONE table and how to handle many table,
I am really confused,

Could you help me

Thank you
------------------------------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32260-guanhua.tian=ia.ac.cn@hbase.apache.org<ma...@hbase.apache.org>
[mailto:user-return-32260-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
发送时间: 2012年12月6日 15:27
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: 答复: 答复: how to store 100billion short text messages with hbase

Thank you for your reply

And in my case, we need to use lucene search engine to retrieval short
message in hbase, and this operation is time critical,
and we also need to access last hour's data in hbase, that is, read out one
hour data from hbase, and this operation is not very time cirtical, and one
hour data is about 100 million or 200 million message,
Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k
messages for results, and  we need to guarantee this is fast enough,
And for this case, if we use one table, when lucene use any message, hbase
need to handle and locate 100billion message itself, if we use 365*24 table
or 365 table, hbase need to handle and locate much less message,

I am really confused ,why ONE table is more suitable than multi table,
Could you give me some help,

Thank you
-------------------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org<ma...@hbase.apache.org>
[mailto:user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:44
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: Re: 答复: how to store 100billion short text messages with hbase

In this case, your best bet may be to come up with an ID structure for these
messages that incorporates (leads with) the timestamp; then have Lucene use
that as the key when retrieving any given message. For example, the ID could
consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, you'll
hot spot one region server; see http://hbase.apache.org/book.html#timeseries
for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a
particular message by this (combined) unique ID. There are also types of
UUIDs that work in this way. But, with that much data, you may want to tune
it to get the smallest possible row key; depending on the granularity of
your timestamp and how unique the "unique" part really needs to be, you
might be able to get this down to < 16 bytes. (Consider that the smallest
possible unique representation of 100B items is 36 bits - that is, log base
2 of 10 billion; but because you also want time to be a part of it, you
probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up single
messages, or small sequential chunks of messages), consider just writing the
data to a file in HDFS and using map/reduce to process it. Scanning all 100B
of your records won't be possible in any short time frame (by my estimate
that would take about 10 hours), but you could do that with map/reduce using
an asynchronous model.

One table is still best for this; read up on what Regions are and why they
mean you don't need multiple tables for the same data:
http://hbase.apache.org/book.html#regions.arch

There are no secondary indexes in HBase:
http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for
this, it'd need its own storage (though there are indeed projects that run
Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

Ian


On Dec 5, 2012, at 9:28 PM, tgh wrote:

Thank you for your reply

And I want to access the data with lucene search engine, that is, with key
to retrieve any message, and I also want to get one hour data together, so I
think to split data table into one hour , or if I can store it in one big
table, is it better than store in 365 table or store in 365*24 table, which
one is best for my data access schema, and I am also confused about how to
make secondary index in hbase , if I have use some key words search engine ,
lucene or other


Could you help me
Thank you

-------------
Tian Guanhua



-----邮件原件-----
发件人:
user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org<mailto:user-return-
32247-guanhua.tian=ia.ac.cn@hbase.apache.org>
[mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:01
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For example,
retrieve any single conversation between two people in < 10 ms; or show all
conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the
row key) so you must choose it carefully, or else duplicate & denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have 13
nodes (taking out 2 for redundant master services) that's a requirement for
about 23T of space per server. That's a lot, even these days. Did I get all
that math right?

On your question about multiple tables: a table in HBase is only a namespace
for rowkeys, and a container for a set of regions. If it's a homogenous data
set, there's no advantage to breaking the table into multiple tables; that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

Hi
I try to use hbase to store 100billion short texts messages, each message
has less than 1000 character and some other items, that is, each messages
has less than 10 items, The whole data is a stream for about one year, and I
want to create multi tables to store these data, I have two ideas, the one
is to store the data in one hour in one table, and for one year data, there
are 365*24 tables, the other is to store the date in one day in one table,
and for one year , there are 365 tables,

And I have about 15 computer nodes to handle these data, and I want to know
how to deal with these data, the one for 365*24 tables , or the one for 365
tables, or other better ideas,

I am really confused about hbase, it is powerful yet a bit complex for me ,
is it?
Could you give me some advice for hbase data schema and others, Could you
help me,


Thank you
---------------------------------
Tian Guanhua














答复: 答复: how to store 100billion short text messages with hbase

Posted by tgh <gu...@ia.ac.cn>.
Meanwhile, we need lucene to retrieve message with keywords or content in
message, after NLP parse processing, and do it without timestamp or
messageID, it is time critical operation,
And we do read one hour data, not with lucene, but with table name, if we
use timestamp about hour as tablename , such as 2012120612 as for table of
data for 12clock on Des 12 2012, and it is about 100million to 200million
messages in table, it is not very time critical operation, 
And if we have 365*24table for one year , does it work , or if we put one
year data in ONE table, will it work more faster than multi tables, and why?
How does hbase manage ONE table and how to handle many table, 
I am really confused, 

Could you help me

Thank you
------------------------------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32260-guanhua.tian=ia.ac.cn@hbase.apache.org
[mailto:user-return-32260-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 tgh
发送时间: 2012年12月6日 15:27
收件人: user@hbase.apache.org
主题: 答复: 答复: how to store 100billion short text messages with hbase

Thank you for your reply

And in my case, we need to use lucene search engine to retrieval short
message in hbase, and this operation is time critical, 
and we also need to access last hour's data in hbase, that is, read out one
hour data from hbase, and this operation is not very time cirtical, and one
hour data is about 100 million or 200 million message,
Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k
messages for results, and  we need to guarantee this is fast enough, 
And for this case, if we use one table, when lucene use any message, hbase
need to handle and locate 100billion message itself, if we use 365*24 table
or 365 table, hbase need to handle and locate much less message, 

I am really confused ,why ONE table is more suitable than multi table, 
Could you give me some help, 

Thank you 
-------------------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org
[mailto:user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:44
收件人: user@hbase.apache.org
主题: Re: 答复: how to store 100billion short text messages with hbase

In this case, your best bet may be to come up with an ID structure for these
messages that incorporates (leads with) the timestamp; then have Lucene use
that as the key when retrieving any given message. For example, the ID could
consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, you'll
hot spot one region server; see http://hbase.apache.org/book.html#timeseries
for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a
particular message by this (combined) unique ID. There are also types of
UUIDs that work in this way. But, with that much data, you may want to tune
it to get the smallest possible row key; depending on the granularity of
your timestamp and how unique the "unique" part really needs to be, you
might be able to get this down to < 16 bytes. (Consider that the smallest
possible unique representation of 100B items is 36 bits - that is, log base
2 of 10 billion; but because you also want time to be a part of it, you
probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up single
messages, or small sequential chunks of messages), consider just writing the
data to a file in HDFS and using map/reduce to process it. Scanning all 100B
of your records won't be possible in any short time frame (by my estimate
that would take about 10 hours), but you could do that with map/reduce using
an asynchronous model.

One table is still best for this; read up on what Regions are and why they
mean you don't need multiple tables for the same data:
http://hbase.apache.org/book.html#regions.arch

There are no secondary indexes in HBase:
http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for
this, it'd need its own storage (though there are indeed projects that run
Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

Ian


On Dec 5, 2012, at 9:28 PM, tgh wrote:

Thank you for your reply

And I want to access the data with lucene search engine, that is, with key
to retrieve any message, and I also want to get one hour data together, so I
think to split data table into one hour , or if I can store it in one big
table, is it better than store in 365 table or store in 365*24 table, which
one is best for my data access schema, and I am also confused about how to
make secondary index in hbase , if I have use some key words search engine ,
lucene or other


Could you help me
Thank you

-------------
Tian Guanhua



-----邮件原件-----
发件人:
user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org<mailto:user-return-
32247-guanhua.tian=ia.ac.cn@hbase.apache.org>
[mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:01
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For example,
retrieve any single conversation between two people in < 10 ms; or show all
conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the
row key) so you must choose it carefully, or else duplicate & denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have 13
nodes (taking out 2 for redundant master services) that's a requirement for
about 23T of space per server. That's a lot, even these days. Did I get all
that math right?

On your question about multiple tables: a table in HBase is only a namespace
for rowkeys, and a container for a set of regions. If it's a homogenous data
set, there's no advantage to breaking the table into multiple tables; that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

Hi
I try to use hbase to store 100billion short texts messages, each message
has less than 1000 character and some other items, that is, each messages
has less than 10 items, The whole data is a stream for about one year, and I
want to create multi tables to store these data, I have two ideas, the one
is to store the data in one hour in one table, and for one year data, there
are 365*24 tables, the other is to store the date in one day in one table,
and for one year , there are 365 tables,

And I have about 15 computer nodes to handle these data, and I want to know
how to deal with these data, the one for 365*24 tables , or the one for 365
tables, or other better ideas,

I am really confused about hbase, it is powerful yet a bit complex for me ,
is it?
Could you give me some advice for hbase data schema and others, Could you
help me,


Thank you
---------------------------------
Tian Guanhua













答复: 答复: how to store 100billion short text messages with hbase

Posted by tgh <gu...@ia.ac.cn>.
Thank you for your reply

And in my case, we need to use lucene search engine to retrieval short
message in hbase, and this operation is time critical, 
and we also need to access last hour's data in hbase, that is, read out one
hour data from hbase, and this operation is not very time cirtical, and one
hour data is about 100 million or 200 million message,
Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k
messages for results, and  we need to guarantee this is fast enough, 
And for this case, if we use one table, when lucene use any message, hbase
need to handle and locate 100billion message itself, if we use 365*24 table
or 365 table, hbase need to handle and locate much less message, 

I am really confused ,why ONE table is more suitable than multi table, 
Could you give me some help, 

Thank you 
-------------------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org
[mailto:user-return-32251-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:44
收件人: user@hbase.apache.org
主题: Re: 答复: how to store 100billion short text messages with hbase

In this case, your best bet may be to come up with an ID structure for these
messages that incorporates (leads with) the timestamp; then have Lucene use
that as the key when retrieving any given message. For example, the ID could
consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, you'll
hot spot one region server; see http://hbase.apache.org/book.html#timeseries
for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a
particular message by this (combined) unique ID. There are also types of
UUIDs that work in this way. But, with that much data, you may want to tune
it to get the smallest possible row key; depending on the granularity of
your timestamp and how unique the "unique" part really needs to be, you
might be able to get this down to < 16 bytes. (Consider that the smallest
possible unique representation of 100B items is 36 bits - that is, log base
2 of 10 billion; but because you also want time to be a part of it, you
probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up single
messages, or small sequential chunks of messages), consider just writing the
data to a file in HDFS and using map/reduce to process it. Scanning all 100B
of your records won't be possible in any short time frame (by my estimate
that would take about 10 hours), but you could do that with map/reduce using
an asynchronous model.

One table is still best for this; read up on what Regions are and why they
mean you don't need multiple tables for the same data:
http://hbase.apache.org/book.html#regions.arch

There are no secondary indexes in HBase:
http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for
this, it'd need its own storage (though there are indeed projects that run
Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

Ian


On Dec 5, 2012, at 9:28 PM, tgh wrote:

Thank you for your reply

And I want to access the data with lucene search engine, that is, with key
to retrieve any message, and I also want to get one hour data together, so I
think to split data table into one hour , or if I can store it in one big
table, is it better than store in 365 table or store in 365*24 table, which
one is best for my data access schema, and I am also confused about how to
make secondary index in hbase , if I have use some key words search engine ,
lucene or other


Could you help me
Thank you

-------------
Tian Guanhua



-----邮件原件-----
发件人:
user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org<mailto:user-return-
32247-guanhua.tian=ia.ac.cn@hbase.apache.org>
[mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:01
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For example,
retrieve any single conversation between two people in < 10 ms; or show all
conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the
row key) so you must choose it carefully, or else duplicate & denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have 13
nodes (taking out 2 for redundant master services) that's a requirement for
about 23T of space per server. That's a lot, even these days. Did I get all
that math right?

On your question about multiple tables: a table in HBase is only a namespace
for rowkeys, and a container for a set of regions. If it's a homogenous data
set, there's no advantage to breaking the table into multiple tables; that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

Hi
I try to use hbase to store 100billion short texts messages, each message
has less than 1000 character and some other items, that is, each messages
has less than 10 items, The whole data is a stream for about one year, and I
want to create multi tables to store these data, I have two ideas, the one
is to store the data in one hour in one table, and for one year data, there
are 365*24 tables, the other is to store the date in one day in one table,
and for one year , there are 365 tables,

And I have about 15 computer nodes to handle these data, and I want to know
how to deal with these data, the one for 365*24 tables , or the one for 365
tables, or other better ideas,

I am really confused about hbase, it is powerful yet a bit complex for me ,
is it?
Could you give me some advice for hbase data schema and others, Could you
help me,


Thank you
---------------------------------
Tian Guanhua












Re: 答复: how to store 100billion short text messages with hbase

Posted by Ian Varley <iv...@salesforce.com>.
In this case, your best bet may be to come up with an ID structure for these messages that incorporates (leads with) the timestamp; then have Lucene use that as the key when retrieving any given message. For example, the ID could consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, you'll hot spot one region server; see http://hbase.apache.org/book.html#timeseries for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a particular message by this (combined) unique ID. There are also types of UUIDs that work in this way. But, with that much data, you may want to tune it to get the smallest possible row key; depending on the granularity of your timestamp and how unique the "unique" part really needs to be, you might be able to get this down to < 16 bytes. (Consider that the smallest possible unique representation of 100B items is 36 bits - that is, log base 2 of 10 billion; but because you also want time to be a part of it, you probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up single messages, or small sequential chunks of messages), consider just writing the data to a file in HDFS and using map/reduce to process it. Scanning all 100B of your records won't be possible in any short time frame (by my estimate that would take about 10 hours), but you could do that with map/reduce using an asynchronous model.

One table is still best for this; read up on what Regions are and why they mean you don't need multiple tables for the same data: http://hbase.apache.org/book.html#regions.arch

There are no secondary indexes in HBase: http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene for this, it'd need its own storage (though there are indeed projects that run Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

Ian


On Dec 5, 2012, at 9:28 PM, tgh wrote:

Thank you for your reply

And I want to access the data with lucene search engine, that is, with key
to retrieve any message, and I also want to get one hour data together, so I
think to split data table into one hour , or if I can store it in one big
table, is it better than store in 365 table or store in 365*24 table, which
one is best for my data access schema, and I am also confused about how to
make secondary index in hbase , if I have use some key words search engine ,
lucene or other


Could you help me
Thank you

-------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org<ma...@hbase.apache.org>
[mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:01
收件人: user@hbase.apache.org<ma...@hbase.apache.org>
主题: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For example,
retrieve any single conversation between two people in < 10 ms; or show all
conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the
row key) so you must choose it carefully, or else duplicate & denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have 13
nodes (taking out 2 for redundant master services) that's a requirement for
about 23T of space per server. That's a lot, even these days. Did I get all
that math right?

On your question about multiple tables: a table in HBase is only a namespace
for rowkeys, and a container for a set of regions. If it's a homogenous data
set, there's no advantage to breaking the table into multiple tables; that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

Hi
I try to use hbase to store 100billion short texts messages, each
message has less than 1000 character and some other items, that is,
each messages has less than 10 items,
The whole data is a stream for about one year, and I want to create
multi tables to store these data, I have two ideas, the one is to
store the data in one hour in one table, and for one year data, there
are 365*24 tables, the other is to store the date in one day in one
table, and for one year , there are 365 tables,

And I have about 15 computer nodes to handle these data, and I want
to know how to deal with these data, the one for 365*24 tables , or
the one for 365 tables, or other better ideas,

I am really confused about hbase, it is powerful yet a bit complex
for me , is it?
Could you give me some advice for hbase data schema and others,
Could you help me,


Thank you
---------------------------------
Tian Guanhua










答复: how to store 100billion short text messages with hbase

Posted by tgh <gu...@ia.ac.cn>.
Thank you for your reply

And I want to access the data with lucene search engine, that is, with key
to retrieve any message, and I also want to get one hour data together, so I
think to split data table into one hour , or if I can store it in one big
table, is it better than store in 365 table or store in 365*24 table, which
one is best for my data access schema, and I am also confused about how to
make secondary index in hbase , if I have use some key words search engine ,
lucene or other 


Could you help me
Thank you

-------------
Tian Guanhua



-----邮件原件-----
发件人: user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org
[mailto:user-return-32247-guanhua.tian=ia.ac.cn@hbase.apache.org] 代表 Ian
Varley
发送时间: 2012年12月6日 11:01
收件人: user@hbase.apache.org
主题: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For example,
retrieve any single conversation between two people in < 10 ms; or show all
conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the
row key) so you must choose it carefully, or else duplicate & denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have 13
nodes (taking out 2 for redundant master services) that's a requirement for
about 23T of space per server. That's a lot, even these days. Did I get all
that math right?

On your question about multiple tables: a table in HBase is only a namespace
for rowkeys, and a container for a set of regions. If it's a homogenous data
set, there's no advantage to breaking the table into multiple tables; that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

> Hi
> 	I try to use hbase to store 100billion short texts messages, each 
> message has less than 1000 character and some other items, that is, 
> each messages has less than 10 items,
> 	The whole data is a stream for about one year, and I want to create 
> multi tables to store these data, I have two ideas, the one is to 
> store the data in one hour in one table, and for one year data, there 
> are 365*24 tables, the other is to store the date in one day in one 
> table, and for one year , there are 365 tables,
> 
> 	And I have about 15 computer nodes to handle these data, and I want 
> to know how to deal with these data, the one for 365*24 tables , or 
> the one for 365 tables, or other better ideas,
> 
> 	I am really confused about hbase, it is powerful yet a bit complex 
> for me , is it?
> 	Could you give me some advice for hbase data schema and others,
> 	Could you help me,
> 
> 
> Thank you
> ---------------------------------
> Tian Guanhua
> 
> 
> 
> 
> 
> 



Re: how to store 100billion short text messages with hbase

Posted by Ian Varley <iv...@salesforce.com>.
Tian,

The best way to think about how to structure your data in HBase is to ask the question: "How will I access it?". Perhaps you could reply with the sorts of queries you expect to be able to do over this data? For example, retrieve any single conversation between two people in < 10 ms; or show all conversations that happened in a single hour, regardless of participants. HBase only gives you fast GET/SCAN access along a single "primary" key (the row key) so you must choose it carefully, or else duplicate & denormalize your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B messages x 1K bytes per message on average comes out to 100TB. That, plus 3x replication in HDFS, means you need roughly 300TB of space. If you have 13 nodes (taking out 2 for redundant master services) that's a requirement for about 23T of space per server. That's a lot, even these days. Did I get all that math right?

On your question about multiple tables: a table in HBase is only a namespace for rowkeys, and a container for a set of regions. If it's a homogenous data set, there's no advantage to breaking the table into multiple tables; that's what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

> Hi
> 	I try to use hbase to store 100billion short texts messages, each
> message has less than 1000 character and some other items, that is, each
> messages has less than 10 items,
> 	The whole data is a stream for about one year, and I want to create
> multi tables to store these data, I have two ideas, the one is to store the
> data in one hour in one table, and for one year data, there are 365*24
> tables, the other is to store the date in one day in one table, and for one
> year , there are 365 tables,
> 
> 	And I have about 15 computer nodes to handle these data, and I want
> to know how to deal with these data, the one for 365*24 tables , or the one
> for 365 tables, or other better ideas, 
> 
> 	I am really confused about hbase, it is powerful yet a bit complex
> for me , is it?
> 	Could you give me some advice for hbase data schema and others,
> 	Could you help me,
> 
> 
> Thank you
> ---------------------------------
> Tian Guanhua
> 
> 
> 
> 
> 
>