You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2009/12/08 16:48:07 UTC

transfer in to AWS will be free of charge through June 2010

> Data Transfer into AWS will be free of
charge from now through June 30, 2010, making it even easier for
customers to get their data into AWS. This applies to data transfer
into Amazon EC2, Amazon S3, Amazon SimpleDB, Amazon Relational Database
Service, Amazon Simple Queue Service, and Amazon Virtual Private Cloud.
Other applicable charges for use of these services continue to apply.

So it looks like a real world true crawling test will be fine until at least June 30.

I have it on my to do list to get a one command test-and-collect (results) version of my heritrix + mozillahtml parser test into the tree as test/contrib/ec2/crawlertest or something like that. This test does the following:
1) Starts up multiple Heritrix2 instances running long lived crawls -- tries to get up to max write throughput of cluster
2) Runs a CPU intensive mapreduce job that reads crawled content out of HBase, builds an org.w3c.document object tree using MozillaHtmlParser, and stores the (bzip) compressed serialization of the object tree back into HBase.
Object sizes stored to and read out of HBase follow real world size distribution by definition. 

In the past this has revealed many bugs -- the most serious being dead space on heap held by I/O buffers that grew but never released their allocations -- and operational considerations: I hit them all... file descriptors, xcievers, I/O saturation leading to ZK aborts, compaction storms, memstore flush gating, etc. 

See subtasks on HBASE-1961 for details that have to be addressed first to make easy fully scripted full system testing up on EC2 possible.

    - Andy