You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by unmesha sreeveni <un...@gmail.com> on 2015/02/19 11:22:52 UTC

part files in wordcount example

Hi

I am trying to understand wordcount example

*public class WordCount {*
* public static void main(String[] args) throws Exception {*
* String source = args[0];*
* String dest = args[1];*
* Configuration conf = new Configuration();*
* FileSystem fs = FileSystem.get(conf);*
* if (fs.exists(new Path(dest))) {*
* fs.delete(new Path(dest), true);*
* }*
* Pipeline pipeline = new MRPipeline(WordCount.class);    *
* PCollection<String> lines = pipeline.readTextFile(source);*

* PCollection<String> words = lines.parallelDo("my splitter", new
DoFn<String, String>() {*
* @Override*
* public void process(String line, Emitter<String> emitter) {*

* // TODO Auto-generated method stub*
* for (String word : line.split("\\s+")) {*
* emitter.emit(word);*

* }*
* }*
* }, Writables.strings());*
* PTable<String, Long> counts = Aggregate.count(words);*
* pipeline.writeTextFile(counts, dest);*
* pipeline.run();*
* }*

*}*

1. Once I ran this under 1.8 GB text file , I am getting 2 part files as
output. so it means that this program ran under 2 reducers. where is it
specified? Or is it done automaticaly?
2. DoFn() is similar to mapper ,reducer,combiner in mapper we are only
emitting the word. But in Mapreduce we are emitting word,1. How is this
aggregate done.
3. Where can I find good tutorials?

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: part files in wordcount example

Posted by unmesha sreeveni <un...@gmail.com>.
Thanks Gabriel
​.​

Re: part files in wordcount example

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Unmesha,

Answers inlined below:

On Thu, Feb 19, 2015 at 11:22 AM, unmesha sreeveni
<un...@gmail.com> wrote:
>
> 1. Once I ran this under 1.8 GB text file , I am getting 2 part files as
> output. so it means that this program ran under 2 reducers. where is it
> specified? Or is it done automaticaly?

In this case, the number of reducers is decided by the size of your
input file, and the crunch.bytes.per.reduce.task configuration value,
which is by 1 GB by default. Based on your input file of 1.8 GB, and a
configured number of bytes per reducer of 1GB, then two reducers are
used.

The Aggregate.count method is also overloaded to allow specifying the
number of partitions (or reducers) to be used without relying on data
size calculations.

> 2. DoFn() is similar to mapper ,reducer,combiner in mapper we are only
> emitting the word. But in Mapreduce we are emitting word,1. How is this
> aggregate done.

The underlying function of this aggregation (implemented via
Aggregate.count) is the same -- it's outputting word,1 and then using
a combiner and a reducer to come to a single count per word.

> 3. Where can I find good tutorials?

There is an initial "Getting started" page at
https://crunch.apache.org/getting-started.html -- that's probably the
best place to start. There is also an in-depth user guide at
https://crunch.apache.org/user-guide.html.

- Gabriel