You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@gora.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/02/21 20:28:59 UTC

Definition of reuseObjects parameters

Hi,

I am really unclear about the above parameter.
For instance. when using GoraMapper.initMapperJob() as follows

  /**
   * Initializes the Mapper, and sets input parameters for the job
   * @param job the job to set the properties for
   * @param query the query to get the inputs from
   * @param dataStore the datastore as the input
   * @param outKeyClass Map output key class
   * @param outValueClass Map output value class
   * @param mapperClass the mapper class extending GoraMapper
   * @param partitionerClass optional partitioner class
   * @param reuseObjects whether to reuse objects in serialization
   */
  @SuppressWarnings("rawtypes")
  public static <K1, V1 extends Persistent, K2, V2> void initMapperJob(
      Job job,
      Query<K1,V1> query,
      DataStore<K1,V1> dataStore,
      Class<K2> outKeyClass,
      Class<V2> outValueClass,
      Class<? extends GoraMapper> mapperClass,
      Class<? extends Partitioner> partitionerClass,
      boolean reuseObjects) throws IOException {
    //set the input via GoraInputFormat
    GoraInputFormat.setInput(job, query, dataStore, reuseObjects);

    job.setMapperClass(mapperClass);
    job.setMapOutputKeyClass(outKeyClass);
    job.setMapOutputValueClass(outValueClass);

    if (partitionerClass != null) {
      job.setPartitionerClass(partitionerClass);
    }
  }

What benefit does setting the boolean value to true or false provide for
us? I am not clear about this.
In Nutch 2.x, the GeneratorJob sets this switch to true whereas the
FetcherJob sets this to false!

Can someone explain and we can document it more thoroughly.

Thanks
Lewis

-- 
*Lewis*

Re: Definition of reuseObjects parameters

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Lewis,

I think this has to do with [1] which means a decision on whether or
not creating objects every time we start emitting data from the mapper
to the reducer or from the reducer to the output. For example, if we
have to create 10 million objects every time is far more expensive
than setting different values 10 million times on a single object. I
bet [1] is a better explanation of what I am trying to say here.
So the GeneratorJob generates the urls to be fetched, and the
FetcherJob actually gets all this data, is this right? If it were,
then the GeneratorJob decision makes sense, and maybe in the Fetcher
we need to keep references to the objects so that is why we don't want
to use a single one.
Anyways, I am just guessing on this last part, not sure if that is
actually how it happens. I will look at the code tomorrow just to be
sure. Hope it helps.


Renato M.

[1] http://wikidoop.com/wiki/Hadoop/ObjectReuse

2013/2/21 Lewis John Mcgibbney <le...@gmail.com>:
> Hi,
>
> I am really unclear about the above parameter.
> For instance. when using GoraMapper.initMapperJob() as follows
>
>   /**
>    * Initializes the Mapper, and sets input parameters for the job
>    * @param job the job to set the properties for
>    * @param query the query to get the inputs from
>    * @param dataStore the datastore as the input
>    * @param outKeyClass Map output key class
>    * @param outValueClass Map output value class
>    * @param mapperClass the mapper class extending GoraMapper
>    * @param partitionerClass optional partitioner class
>    * @param reuseObjects whether to reuse objects in serialization
>    */
>   @SuppressWarnings("rawtypes")
>   public static <K1, V1 extends Persistent, K2, V2> void initMapperJob(
>       Job job,
>       Query<K1,V1> query,
>       DataStore<K1,V1> dataStore,
>       Class<K2> outKeyClass,
>       Class<V2> outValueClass,
>       Class<? extends GoraMapper> mapperClass,
>       Class<? extends Partitioner> partitionerClass,
>       boolean reuseObjects) throws IOException {
>     //set the input via GoraInputFormat
>     GoraInputFormat.setInput(job, query, dataStore, reuseObjects);
>
>     job.setMapperClass(mapperClass);
>     job.setMapOutputKeyClass(outKeyClass);
>     job.setMapOutputValueClass(outValueClass);
>
>     if (partitionerClass != null) {
>       job.setPartitionerClass(partitionerClass);
>     }
>   }
>
> What benefit does setting the boolean value to true or false provide for us?
> I am not clear about this.
> In Nutch 2.x, the GeneratorJob sets this switch to true whereas the
> FetcherJob sets this to false!
>
> Can someone explain and we can document it more thoroughly.
>
> Thanks
> Lewis
>
> --
> Lewis