You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Josh Elser <el...@apache.org> on 2019/06/02 01:43:39 UTC
Re: Scan vs TableInputFormat to process data
Hi Guillermo,
Yes, you are missing something.
TableInputFormat uses the Scan API just like Spark would.
Bypassing the RegionServer and reading from HFiles directly is
accomplished by using the TableSnapshotInputFormat. You can only read
from HFiles directly when you are using a Snapshot, as there are
concurrency issues WRT the lifecycle of HFiles managed by HBase. It is
not safe to try to HFiles underneath HBase on your own unless you are
confident you understand all the edge cases in how HBase manages files.
On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> Just to be sure, if I execute Scan inside Spark, the execution is goig
> through RegionServers and I get all the features of HBase/Scan (filters and
> so on), all the parallelization is in charge of the RegionServers (even
> I'm running the program with spark)
> If I use TableInputFormat I read all the column families (even If I don't
> want to) , not previous filter either, it's just open the files of a hbase
> table and process them completly. All te parallelization is in Spark and
> don't use HBase at all, it's just read in HDFS the files what HBase stored
> for a specific table.
>
> Am I missing something?
>
Re: Scan vs TableInputFormat to process data
Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Also, keep in mind that by bypassing the RegionServer you also bypass the
security rules...
JMS
Le sam. 1 juin 2019 à 21:43, Josh Elser <el...@apache.org> a écrit :
> Hi Guillermo,
>
> Yes, you are missing something.
>
> TableInputFormat uses the Scan API just like Spark would.
>
> Bypassing the RegionServer and reading from HFiles directly is
> accomplished by using the TableSnapshotInputFormat. You can only read
> from HFiles directly when you are using a Snapshot, as there are
> concurrency issues WRT the lifecycle of HFiles managed by HBase. It is
> not safe to try to HFiles underneath HBase on your own unless you are
> confident you understand all the edge cases in how HBase manages files.
>
> On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> > Just to be sure, if I execute Scan inside Spark, the execution is goig
> > through RegionServers and I get all the features of HBase/Scan (filters
> and
> > so on), all the parallelization is in charge of the RegionServers (even
> > I'm running the program with spark)
> > If I use TableInputFormat I read all the column families (even If I don't
> > want to) , not previous filter either, it's just open the files of a
> hbase
> > table and process them completly. All te parallelization is in Spark and
> > don't use HBase at all, it's just read in HDFS the files what HBase
> stored
> > for a specific table.
> >
> > Am I missing something?
> >
>