You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Josh Elser <el...@apache.org> on 2019/06/02 01:43:39 UTC

Re: Scan vs TableInputFormat to process data

Hi Guillermo,

Yes, you are missing something.

TableInputFormat uses the Scan API just like Spark would.

Bypassing the RegionServer and reading from HFiles directly is 
accomplished by using the TableSnapshotInputFormat. You can only read 
from HFiles directly when you are using a Snapshot, as there are 
concurrency issues WRT the lifecycle of HFiles managed by HBase. It is 
not safe to try to HFiles underneath HBase on your own unless you are 
confident you understand all the edge cases in how HBase manages files.

On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> Just to be sure, if I execute Scan inside Spark, the execution is goig
> through RegionServers and I get all the features of HBase/Scan (filters and
> so on), all the parallelization is in charge of the RegionServers (even
> I'm  running the program with spark)
> If I use TableInputFormat I read all the column families (even If I don't
> want to) , not previous filter either, it's just open the files of a hbase
> table and process them completly. All te parallelization is in Spark and
> don't use HBase at all, it's just read in HDFS the files what HBase stored
> for a specific table.
> 
> Am I missing something?
> 

Re: Scan vs TableInputFormat to process data

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Also, keep in  mind that by bypassing the RegionServer you also bypass the
security rules...

JMS

Le sam. 1 juin 2019 à 21:43, Josh Elser <el...@apache.org> a écrit :

> Hi Guillermo,
>
> Yes, you are missing something.
>
> TableInputFormat uses the Scan API just like Spark would.
>
> Bypassing the RegionServer and reading from HFiles directly is
> accomplished by using the TableSnapshotInputFormat. You can only read
> from HFiles directly when you are using a Snapshot, as there are
> concurrency issues WRT the lifecycle of HFiles managed by HBase. It is
> not safe to try to HFiles underneath HBase on your own unless you are
> confident you understand all the edge cases in how HBase manages files.
>
> On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> > Just to be sure, if I execute Scan inside Spark, the execution is goig
> > through RegionServers and I get all the features of HBase/Scan (filters
> and
> > so on), all the parallelization is in charge of the RegionServers (even
> > I'm  running the program with spark)
> > If I use TableInputFormat I read all the column families (even If I don't
> > want to) , not previous filter either, it's just open the files of a
> hbase
> > table and process them completly. All te parallelization is in Spark and
> > don't use HBase at all, it's just read in HDFS the files what HBase
> stored
> > for a specific table.
> >
> > Am I missing something?
> >
>