You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Yin Huai <hu...@gmail.com> on 2012/01/10 04:16:22 UTC

Re: RCFile in java MapReduce

I have some experiences using RCFile with new MapReduce API from the
project HCatalog ( http://incubator.apache.org/hcatalog/ ).

For the output part,
In your main, you need ...

> job.setOutputFormatClass(RCFileMapReduceOutputFormat.class);
>
> RCFileMapReduceOutputFormat.setColumnNumber(job.getConfiguration(),
>> numCols); // numCols is the total number of columns of your output table
>
> RCFileMapReduceOutputFormat.setOutputPath(job, new Path(outputPath));
>
> RCFileMapReduceOutputFormat.setCompressOutput(job, true);
>
> The Map class would look like ...

> public static class Map
>
>     extends Mapper<Object, Text, NullWritable, BytesRefArrayWritable>{
>
>   private byte[] fieldData;
>
>  private int numCols;
>
>  private BytesRefArrayWritable bytes;
>
>   @Override
>
>  protected void setup(Context context) throws IOException,
>> InterruptedException {
>
>  numCols =
>> context.getConfiguration().getInt("hive.io.rcfile.column.number.conf", 0);
>
>  bytes = new BytesRefArrayWritable(numCols);
>
>  }
>
>   public void map(Object key, Text line, Context context
>
>                 ) throws IOException, InterruptedException {
>
>  bytes.clear();
>
>  String[] cols = line.toString().split("\\|");
>
>  for (int i=0; i<numCols; i++){
>
>          fieldData = cols[i].getBytes("UTF-8");
>
>          BytesRefWritable cu = null;
>
>             cu = new BytesRefWritable(fieldData, 0, fieldData.length);
>
>             bytes.set(i, cu);
>
>         }
>
>  context.write(NullWritable.get(), bytes);
>
>  }
>
>  }
>
> Basically, you need to convert a row to a BytesRefArrayWritable object
(which is bytes in above example).

For the input part, I do not know how to use RCFileMapReduceInputFormat to
write a MapReduce job for a join operation, so I customized a new
InputFormat and RecordReader.
You can find these two class (MultiRCFileMapReduceInputFormat and
MultiRCFileMapReduceRecordReader) from
http://www.cse.ohio-state.edu/~huai/RCFile/ .
In this link, TestPrintTables.java is an example program that you can use
it to convert tables in RCFile format to text. I hope that this example is
self-explaining. If you need to

Hope these can help you.

Thanks,

Yin

On Wed, Dec 14, 2011 at 8:54 AM, Dominik Wiernicki <dw...@touk.pl> wrote:

> Hi,
>
> Can someone show me how to use RCfile in plain MapReduce job (as Input and
> Output Format)?
> Please.
>
>
>

Re: RCFile in java MapReduce

Posted by Aniket Mokashi <an...@gmail.com>.
A better way would be to mount a table on top of RCFiles and use
http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html#HCatInputFormat
But, you will have to install and run hcatalog server for it.

(Note: By default, hcatalog assumes underlying storage is RCFile, so you do
not need to patch any metadata aka do not need create table through hcat).

Thanks,
Aniket

On Mon, Jan 9, 2012 at 7:16 PM, Yin Huai <hu...@gmail.com> wrote:

> I have some experiences using RCFile with new MapReduce API from the
> project HCatalog ( http://incubator.apache.org/hcatalog/ ).
>
> For the output part,
> In your main, you need ...
>
>> job.setOutputFormatClass(RCFileMapReduceOutputFormat.class);
>>
>> RCFileMapReduceOutputFormat.setColumnNumber(job.getConfiguration(),
>>> numCols); // numCols is the total number of columns of your output table
>>
>> RCFileMapReduceOutputFormat.setOutputPath(job, new Path(outputPath));
>>
>> RCFileMapReduceOutputFormat.setCompressOutput(job, true);
>>
>> The Map class would look like ...
>
>> public static class Map
>>
>>     extends Mapper<Object, Text, NullWritable, BytesRefArrayWritable>{
>>
>>   private byte[] fieldData;
>>
>>  private int numCols;
>>
>>  private BytesRefArrayWritable bytes;
>>
>>   @Override
>>
>>  protected void setup(Context context) throws IOException,
>>> InterruptedException {
>>
>>  numCols =
>>> context.getConfiguration().getInt("hive.io.rcfile.column.number.conf", 0);
>>
>>  bytes = new BytesRefArrayWritable(numCols);
>>
>>  }
>>
>>   public void map(Object key, Text line, Context context
>>
>>                 ) throws IOException, InterruptedException {
>>
>>  bytes.clear();
>>
>>  String[] cols = line.toString().split("\\|");
>>
>>  for (int i=0; i<numCols; i++){
>>
>>          fieldData = cols[i].getBytes("UTF-8");
>>
>>          BytesRefWritable cu = null;
>>
>>             cu = new BytesRefWritable(fieldData, 0, fieldData.length);
>>
>>             bytes.set(i, cu);
>>
>>         }
>>
>>  context.write(NullWritable.get(), bytes);
>>
>>  }
>>
>>  }
>>
>> Basically, you need to convert a row to a BytesRefArrayWritable object
> (which is bytes in above example).
>
> For the input part, I do not know how to use RCFileMapReduceInputFormat to
> write a MapReduce job for a join operation, so I customized a new
> InputFormat and RecordReader.
> You can find these two class (MultiRCFileMapReduceInputFormat and
> MultiRCFileMapReduceRecordReader) from
> http://www.cse.ohio-state.edu/~huai/RCFile/ .
>  In this link, TestPrintTables.java is an example program that you can use
> it to convert tables in RCFile format to text. I hope that this example is
> self-explaining. If you need to
>
> Hope these can help you.
>
> Thanks,
>
> Yin
>
> On Wed, Dec 14, 2011 at 8:54 AM, Dominik Wiernicki <dw...@touk.pl> wrote:
>
>> Hi,
>>
>> Can someone show me how to use RCfile in plain MapReduce job (as Input
>> and Output Format)?
>> Please.
>>
>>
>>
>


-- 
"...:::Aniket:::... Quetzalco@tl"