You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by al...@aim.com on 2013/02/27 01:44:44 UTC

gora-hbase query

Hello,

Can someone point me the code in gora-hbase that queries hbase and populates nutch map key values for varies nutch jobs?
I plan to use SingleColumnValueFilter to see if it selects only subset of records.

Thanks.
Alex.

 

Re: gora-hbase query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Alex,

I think I understand what you are talking about.
As you know the query object is sent to the mappers through the conf
object and then used in the GoraRecordReader [2] I think this is the
most important class used for this. The class where we query HBase is
in [1].
There is a really simple example in [3] about using Gora's MapReduce
support but I don't think it will be what you are looking for. You
have just pointed out a really important issue here, we should
probably create some simple examples on how to use it, if you feel
like tackling this would be awesome man!
Let's us know if we can help you out working this out.


Renato M.

[1] https://github.com/renato2099/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java
[2] https://github.com/renato2099/gora/blob/trunk/gora-core/src/main/java/org/apache/gora/mapreduce/GoraRecordReader.java
[3] https://github.com/renato2099/gora/blob/trunk/gora-core/src/examples/java/org/apache/gora/examples/mapreduce/QueryCounter.java

2013/3/1  <al...@aim.com>:
> Hi Renato,
>
>
>
>
> So once in the Mappers, we will use the query object to perform the
> data retrieval operation from the specific data store.
> Hope this helps man.
>
>
>  I need to see the code that does what you specified above.
>
> The  setQuery  function calls IOUtils.storeToConf(query, job.getConfiguration(), QUERY_KEY);
>
> and
>
>  public static<T> void storeToConf(T obj, Configuration conf, String dataKey)
>     throws IOException {
>     String classKey = dataKey + "._class";
>     conf.set(classKey, obj.getClass().getCanonicalName());
>     DefaultStringifier.store(conf, obj, dataKey);
>   }
>
> function simply sets configuration.
>
> Where is the call to hbase to retrieve data then?
>
>
> Thanks.
> Alex.
>
>
>
>
>
> -----Original Message-----
> From: Renato Marroquín Mog
>  rovejo <re...@gmail.com>
> To: Gora Dev <de...@gora.apache.org>
> Sent: Thu, Feb 28, 2013 9:47 pm
> Subject: Re: gora-hbase query
>
>
> Hi Alex,
>
> My answers are inline.
>
>
> 2013/2/27  <al...@aim.com>:
>> Hi,
>>
>> I am mostly interested in fetcher job. In this job I see this code
>>
>> StorageUtils.initMapperJob(currentJob, fields, IntWritable.class,
> FetchEntry.class, FetcherMapper.class, FetchEntryPartitioner.class, false);
>>
>> In storage utils this function has
>>
>> DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
> String.class, WebPage.class);
>>     if (store==null) throw new RuntimeException("Could not create datastore");
>>  Query<String, WebPage> query = store.newQuery();
>>  query.setFields(toStringArray(fields));
>>  GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass,
> mapperClass, partitionerClass, reuseObjects);
>
> So what you are doing is that you are starting a MapReduce job which
> uses the Query Object to get the data out of a specific data store
> [1]. Therefore, all the magic happens within the GoraMapper code. Look
> at the initMapperJob method
>
> {code}
>   @SuppressWarnings("rawtypes")
>   public static <K1, V1 extends Persistent, K2, V2> void initMapperJob(
>       Job job,
>       Query<K1,V1> query,
>       DataStore<K1,V1> dataStore,
>       Class<K2> outKeyClass,
>       Class<V2> outValueClass,
>       Class<? extends GoraMapper> mapperClass,
>       Class<? extends Partitioner> partitionerClass,
>       boolean reuseObjects) throws IOException {
>     //set the input via GoraInputFormat
>     GoraInputFormat.setInput(job, query, dataStore, reuseObjects);
>
>     job.setMapperClass(mapperClass);
>     job.setMapOutputKeyClass(outKeyClass);
>     job.setMapOutputValueClass(outValueClass);
>
>     if (partitionerClass != null) {
>       job.setPartitionerClass(partitionerClass);
>     }
>   }
> {\code}
>
> Then, the method that will continue the work is the
> GoraInputFormat[2].setInput which then will use the setQuery method to
> pass this object through the job configuration to all mappers which
> will then perform the query (yes, the regular query we define to get
> data from data stores).
>
> {code}
>
>  public static<K, T extends Persistent> void setQuery(Job job
>       , Query<K, T> query) throws IOException {
>     IOUtils.storeToConf(query, job.getConfiguration(), QUERY_KEY);
>   }
>
> {\code}
>
>> I followed all these functions but did not find actual code that sends query
> to hbase table.
>> I believe it is somewhere in gora-hbase.
>
> So once in the Mappers, we will use the query object to perform the
> data retrieval operation from the specific data store.
> Hope this helps man.
>
>
> Renato M.
>
>> Thanks.
>> Alex.
>>
>
> [1] http://gora.apache.org/docs/current/apidocs-0.2.1/org/apache/gora/mapreduce/GoraMapper.html#initMapperJob(org.apache.hadoop.mapreduce.Job,
> org.apache.gora.query.Query, org.apache.gora.store.DataStore,
> java.lang.Class, java.lang.Class, java.lang.Class, boolean)
>
> [2] https://github.com/renato2099/gora/blob/trunk/gora-core/src/main/java/org/apache/gora/mapreduce/GoraInputFormat.java
>>
>>
>>
>>
>> -----Original Message-----
>> From: Renato Marroquín Mog
>>  rovejo <re...@gmail.com>
>> To: Gora Dev <de...@gora.apache.org>
>> Sent: Tue, Feb 26, 2013 8:01 pm
>> Subject: Re: gora-hbase query
>>
>>
>> Hi Alex,
>>
>> The Gora-HBase module is only in charge of querying and persisting
>> data from any where, not only Nutch. That being said, you want the
>> part where Nutch populates a map used in different Nutch jobs? Which
>> jobs are you talking about? Generator? Fetcher? You can probably get
>> some more lights over in NutchLand.
>> I am happy to go over the code with you anyways, just please be a
>> little bit more specific.
>>
>>
>> Renato M.
>>
>> 2013/2/26  <al...@aim.com>:
>>>
>>> Hello,
>>>
>>> Can someone point me the code in gora-hbase that queries hbase and populates
>> nutch map key values for varies nutch jobs?
>>> I plan to use SingleColumnValueFilter to see if it selects only subset of
>> records.
>>>
>>> Thanks.
>>> Alex.
>>>
>>>
>>
>>
>
>

Re: gora-hbase query

Posted by al...@aim.com.
Hi Renato,
 



So once in the Mappers, we will use the query object to perform the
data retrieval operation from the specific data store.
Hope this helps man.


 I need to see the code that does what you specified above.

The  setQuery  function calls IOUtils.storeToConf(query, job.getConfiguration(), QUERY_KEY);

and

 public static<T> void storeToConf(T obj, Configuration conf, String dataKey)
    throws IOException {
    String classKey = dataKey + "._class";
    conf.set(classKey, obj.getClass().getCanonicalName());
    DefaultStringifier.store(conf, obj, dataKey);
  }

function simply sets configuration. 

Where is the call to hbase to retrieve data then?


Thanks.
Alex.

 

 

-----Original Message-----
From: Renato Marroquín Mog
 rovejo <re...@gmail.com>
To: Gora Dev <de...@gora.apache.org>
Sent: Thu, Feb 28, 2013 9:47 pm
Subject: Re: gora-hbase query


Hi Alex,

My answers are inline.


2013/2/27  <al...@aim.com>:
> Hi,
>
> I am mostly interested in fetcher job. In this job I see this code
>
> StorageUtils.initMapperJob(currentJob, fields, IntWritable.class,  
FetchEntry.class, FetcherMapper.class, FetchEntryPartitioner.class, false);
>
> In storage utils this function has
>
> DataStore<String, WebPage> store = createWebStore(job.getConfiguration(), 
String.class, WebPage.class);
>     if (store==null) throw new RuntimeException("Could not create datastore");
>  Query<String, WebPage> query = store.newQuery();
>  query.setFields(toStringArray(fields));
>  GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass, 
mapperClass, partitionerClass, reuseObjects);

So what you are doing is that you are starting a MapReduce job which
uses the Query Object to get the data out of a specific data store
[1]. Therefore, all the magic happens within the GoraMapper code. Look
at the initMapperJob method

{code}
  @SuppressWarnings("rawtypes")
  public static <K1, V1 extends Persistent, K2, V2> void initMapperJob(
      Job job,
      Query<K1,V1> query,
      DataStore<K1,V1> dataStore,
      Class<K2> outKeyClass,
      Class<V2> outValueClass,
      Class<? extends GoraMapper> mapperClass,
      Class<? extends Partitioner> partitionerClass,
      boolean reuseObjects) throws IOException {
    //set the input via GoraInputFormat
    GoraInputFormat.setInput(job, query, dataStore, reuseObjects);

    job.setMapperClass(mapperClass);
    job.setMapOutputKeyClass(outKeyClass);
    job.setMapOutputValueClass(outValueClass);

    if (partitionerClass != null) {
      job.setPartitionerClass(partitionerClass);
    }
  }
{\code}

Then, the method that will continue the work is the
GoraInputFormat[2].setInput which then will use the setQuery method to
pass this object through the job configuration to all mappers which
will then perform the query (yes, the regular query we define to get
data from data stores).

{code}

 public static<K, T extends Persistent> void setQuery(Job job
      , Query<K, T> query) throws IOException {
    IOUtils.storeToConf(query, job.getConfiguration(), QUERY_KEY);
  }

{\code}

> I followed all these functions but did not find actual code that sends query 
to hbase table.
> I believe it is somewhere in gora-hbase.

So once in the Mappers, we will use the query object to perform the
data retrieval operation from the specific data store.
Hope this helps man.


Renato M.

> Thanks.
> Alex.
>

[1] http://gora.apache.org/docs/current/apidocs-0.2.1/org/apache/gora/mapreduce/GoraMapper.html#initMapperJob(org.apache.hadoop.mapreduce.Job,
org.apache.gora.query.Query, org.apache.gora.store.DataStore,
java.lang.Class, java.lang.Class, java.lang.Class, boolean)

[2] https://github.com/renato2099/gora/blob/trunk/gora-core/src/main/java/org/apache/gora/mapreduce/GoraInputFormat.java
>
>
>
>
> -----Original Message-----
> From: Renato Marroquín Mog
>  rovejo <re...@gmail.com>
> To: Gora Dev <de...@gora.apache.org>
> Sent: Tue, Feb 26, 2013 8:01 pm
> Subject: Re: gora-hbase query
>
>
> Hi Alex,
>
> The Gora-HBase module is only in charge of querying and persisting
> data from any where, not only Nutch. That being said, you want the
> part where Nutch populates a map used in different Nutch jobs? Which
> jobs are you talking about? Generator? Fetcher? You can probably get
> some more lights over in NutchLand.
> I am happy to go over the code with you anyways, just please be a
> little bit more specific.
>
>
> Renato M.
>
> 2013/2/26  <al...@aim.com>:
>>
>> Hello,
>>
>> Can someone point me the code in gora-hbase that queries hbase and populates
> nutch map key values for varies nutch jobs?
>> I plan to use SingleColumnValueFilter to see if it selects only subset of
> records.
>>
>> Thanks.
>> Alex.
>>
>>
>
>

 

Re: gora-hbase query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Alex,

My answers are inline.


2013/2/27  <al...@aim.com>:
> Hi,
>
> I am mostly interested in fetcher job. In this job I see this code
>
> StorageUtils.initMapperJob(currentJob, fields, IntWritable.class,  FetchEntry.class, FetcherMapper.class, FetchEntryPartitioner.class, false);
>
> In storage utils this function has
>
> DataStore<String, WebPage> store = createWebStore(job.getConfiguration(), String.class, WebPage.class);
>     if (store==null) throw new RuntimeException("Could not create datastore");
>  Query<String, WebPage> query = store.newQuery();
>  query.setFields(toStringArray(fields));
>  GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass, mapperClass, partitionerClass, reuseObjects);

So what you are doing is that you are starting a MapReduce job which
uses the Query Object to get the data out of a specific data store
[1]. Therefore, all the magic happens within the GoraMapper code. Look
at the initMapperJob method

{code}
  @SuppressWarnings("rawtypes")
  public static <K1, V1 extends Persistent, K2, V2> void initMapperJob(
      Job job,
      Query<K1,V1> query,
      DataStore<K1,V1> dataStore,
      Class<K2> outKeyClass,
      Class<V2> outValueClass,
      Class<? extends GoraMapper> mapperClass,
      Class<? extends Partitioner> partitionerClass,
      boolean reuseObjects) throws IOException {
    //set the input via GoraInputFormat
    GoraInputFormat.setInput(job, query, dataStore, reuseObjects);

    job.setMapperClass(mapperClass);
    job.setMapOutputKeyClass(outKeyClass);
    job.setMapOutputValueClass(outValueClass);

    if (partitionerClass != null) {
      job.setPartitionerClass(partitionerClass);
    }
  }
{\code}

Then, the method that will continue the work is the
GoraInputFormat[2].setInput which then will use the setQuery method to
pass this object through the job configuration to all mappers which
will then perform the query (yes, the regular query we define to get
data from data stores).

{code}

 public static<K, T extends Persistent> void setQuery(Job job
      , Query<K, T> query) throws IOException {
    IOUtils.storeToConf(query, job.getConfiguration(), QUERY_KEY);
  }

{\code}

> I followed all these functions but did not find actual code that sends query to hbase table.
> I believe it is somewhere in gora-hbase.

So once in the Mappers, we will use the query object to perform the
data retrieval operation from the specific data store.
Hope this helps man.


Renato M.

> Thanks.
> Alex.
>

[1] http://gora.apache.org/docs/current/apidocs-0.2.1/org/apache/gora/mapreduce/GoraMapper.html#initMapperJob(org.apache.hadoop.mapreduce.Job,
org.apache.gora.query.Query, org.apache.gora.store.DataStore,
java.lang.Class, java.lang.Class, java.lang.Class, boolean)

[2] https://github.com/renato2099/gora/blob/trunk/gora-core/src/main/java/org/apache/gora/mapreduce/GoraInputFormat.java
>
>
>
>
> -----Original Message-----
> From: Renato Marroquín Mog
>  rovejo <re...@gmail.com>
> To: Gora Dev <de...@gora.apache.org>
> Sent: Tue, Feb 26, 2013 8:01 pm
> Subject: Re: gora-hbase query
>
>
> Hi Alex,
>
> The Gora-HBase module is only in charge of querying and persisting
> data from any where, not only Nutch. That being said, you want the
> part where Nutch populates a map used in different Nutch jobs? Which
> jobs are you talking about? Generator? Fetcher? You can probably get
> some more lights over in NutchLand.
> I am happy to go over the code with you anyways, just please be a
> little bit more specific.
>
>
> Renato M.
>
> 2013/2/26  <al...@aim.com>:
>>
>> Hello,
>>
>> Can someone point me the code in gora-hbase that queries hbase and populates
> nutch map key values for varies nutch jobs?
>> I plan to use SingleColumnValueFilter to see if it selects only subset of
> records.
>>
>> Thanks.
>> Alex.
>>
>>
>
>

Re: gora-hbase query

Posted by al...@aim.com.
Hi,

I am mostly interested in fetcher job. In this job I see this code

StorageUtils.initMapperJob(currentJob, fields, IntWritable.class,  FetchEntry.class, FetcherMapper.class, FetchEntryPartitioner.class, false);

In storage utils this function has

DataStore<String, WebPage> store = createWebStore(job.getConfiguration(), String.class, WebPage.class);
    if (store==null) throw new RuntimeException("Could not create datastore");
 Query<String, WebPage> query = store.newQuery();
 query.setFields(toStringArray(fields));
 GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass, mapperClass, partitionerClass, reuseObjects);

 
I followed all these functions but did not find actual code that sends query to hbase table.
I believe it is somewhere in gora-hbase.

Thanks.
Alex.


 

 

-----Original Message-----
From: Renato Marroquín Mog
 rovejo <re...@gmail.com>
To: Gora Dev <de...@gora.apache.org>
Sent: Tue, Feb 26, 2013 8:01 pm
Subject: Re: gora-hbase query


Hi Alex,

The Gora-HBase module is only in charge of querying and persisting
data from any where, not only Nutch. That being said, you want the
part where Nutch populates a map used in different Nutch jobs? Which
jobs are you talking about? Generator? Fetcher? You can probably get
some more lights over in NutchLand.
I am happy to go over the code with you anyways, just please be a
little bit more specific.


Renato M.

2013/2/26  <al...@aim.com>:
>
> Hello,
>
> Can someone point me the code in gora-hbase that queries hbase and populates 
nutch map key values for varies nutch jobs?
> I plan to use SingleColumnValueFilter to see if it selects only subset of 
records.
>
> Thanks.
> Alex.
>
>

 

Re: gora-hbase query

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.
Hi Alex,

The Gora-HBase module is only in charge of querying and persisting
data from any where, not only Nutch. That being said, you want the
part where Nutch populates a map used in different Nutch jobs? Which
jobs are you talking about? Generator? Fetcher? You can probably get
some more lights over in NutchLand.
I am happy to go over the code with you anyways, just please be a
little bit more specific.


Renato M.

2013/2/26  <al...@aim.com>:
>
> Hello,
>
> Can someone point me the code in gora-hbase that queries hbase and populates nutch map key values for varies nutch jobs?
> I plan to use SingleColumnValueFilter to see if it selects only subset of records.
>
> Thanks.
> Alex.
>
>