You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@whirr.apache.org by Olivier Grisel <ol...@ensta.org> on 2011/01/03 17:26:53 UTC

Some feedback and some questions

Hi all,

First I would like to thank you for developing Whirr, I find it really
useful with a very simple getting started guide.

Here are some feedback and some questions based on my first experience.

Context: I am using whirr 0.2.0-incubating to setup a 1 nn+jt, 20
dn+tt Hadoop cluster to run pig scripts to preprocess Wikipedia /
DBpedia dumps so as to transform them to create Natural Language
Processing training corpus. I am using Whirr instead of
ElasticMapReduce because I need some features from Pig 0.8.0 that is
not yet available on ElasticMapReduce.

The code is hosted on github ( https://github.com/ogrisel/pignlproc )
and I documented my use of Whirr on this wiki page:

https://github.com/ogrisel/pignlproc/wiki/Running-pignlproc-scripts-on-a-EC2-Hadoop-cluster

Here are my comments questions:

1- I need to pass the amazon credentials so as to use hadoop distcp
between my S3 buckets and the cluster HDFS. I figured out that I could
edit the hadoop-site.xml file generated by whirr to add them here and
it works. However I have to do this manually everytime I create a new
cluster. Is there a way to avoid loosing this configuration between
two runs?

2- Some of my scripts have input of the order of 20 GB and same for
output. The default replication factor of HDFS triple that space
requirement. Furthermore when I use the NameNode web interface I see
that the total DFS capacity for 20 nodes is 130GB. With the
intermediate results of the pig scripts I can sometime reach that
limit. Since the m1.small instance come with 160 GB instance storage
each, would it be possible to change the HFDS configuration to
allocate more space to the DFS? Or is there a Whirr configuration
parameter somewhere to do that?

3- The default pig setup need the write permission on hdfs:///tmp with
the default user (root) and this is not the case in the default Whirr
configuration. I figured out that pig 0.8.0 allows you to move that
tmp folder to another place (e.g. "SET pig.temp.dir /user/root/tmp")
but it would be more friendly for pig users if the default Whirr
Hadoop cluster configuration would allow pig to work without such
custom parameters. Should I open a jira issue for this?

4- I noticed that despite telling pig to "SET default_parallel 10" I
only get maximum 1 reducers of each of my jobs. Hence I suspect that
the Hadoop configuration of Whirr is limiting this. How can I
configure Whirr to start the cluster with a higher number of reducers
(e.g. the number of tt nodes for instance).

As a summary it seems that most of this questions could be answered if
whirr would allow for passing configuration file snippets or
individual Hadoop property configuration values. I there a way to do
this without having to maintain a custom version of the Hadoop
installation scripts?

Regards,

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Some feedback and some questions

Posted by Olivier Grisel <ol...@ensta.org>.

2011/1/4 Tom White <to...@gmail.com>:
> Hi Olivier,
>
> Thanks for the feedback! I've responded inline below.

Thank you very much for your reply. Looking forward to the next release now :)

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Some feedback and some questions

Posted by Tom White <to...@gmail.com>.

Hi Olivier,

Thanks for the feedback! I've responded inline below.

Cheers,
Tom

On Mon, Jan 3, 2011 at 8:26 AM, Olivier Grisel <ol...@ensta.org> wrote:
> Hi all,
>
> First I would like to thank you for developing Whirr, I find it really
> useful with a very simple getting started guide.
>
> Here are some feedback and some questions based on my first experience.
>
> Context: I am using whirr 0.2.0-incubating to setup a 1 nn+jt, 20
> dn+tt Hadoop cluster to run pig scripts to preprocess Wikipedia /
> DBpedia dumps so as to transform them to create Natural Language
> Processing training corpus. I am using Whirr instead of
> ElasticMapReduce because I need some features from Pig 0.8.0 that is
> not yet available on ElasticMapReduce.
>
> The code is hosted on github ( https://github.com/ogrisel/pignlproc )
> and I documented my use of Whirr on this wiki page:
>
>  https://github.com/ogrisel/pignlproc/wiki/Running-pignlproc-scripts-on-a-EC2-Hadoop-cluster
>
> Here are my comments questions:
>
> 1- I need to pass the amazon credentials so as to use hadoop distcp
> between my S3 buckets and the cluster HDFS. I figured out that I could
> edit the hadoop-site.xml file generated by whirr to add them here and
> it works. However I have to do this manually everytime I create a new
> cluster. Is there a way to avoid loosing this configuration between
> two runs?

Not in the version you are using, but
https://issues.apache.org/jira/browse/WHIRR-176 was committed recently
which fixes this problem. This will be available in the next release.

>
> 2- Some of my scripts have input of the order of 20 GB and same for
> output. The default replication factor of HDFS triple that space
> requirement. Furthermore when I use the NameNode web interface I see
> that the total DFS capacity for 20 nodes is 130GB. With the
> intermediate results of the pig scripts I can sometime reach that
> limit. Since the m1.small instance come with 160 GB instance storage
> each, would it be possible to change the HFDS configuration to
> allocate more space to the DFS? Or is there a Whirr configuration
> parameter somewhere to do that?

It looks like Whirr isn't taking advantage of all the local storage.
I've filed https://issues.apache.org/jira/browse/WHIRR-189 for this.

>
> 3- The default pig setup need the write permission on hdfs:///tmp with
> the default user (root) and this is not the case in the default Whirr
> configuration. I figured out that pig 0.8.0 allows you to move that
> tmp folder to another place (e.g. "SET pig.temp.dir /user/root/tmp")
> but it would be more friendly for pig users if the default Whirr
> Hadoop cluster configuration would allow pig to work without such
> custom parameters. Should I open a jira issue for this?

I've opened https://issues.apache.org/jira/browse/WHIRR-190 for this.
You could run this command manually for the time being.

>
> 4- I noticed that despite telling pig to "SET default_parallel 10" I
> only get maximum 1 reducers of each of my jobs. Hence I suspect that
> the Hadoop configuration of Whirr is limiting this. How can I
> configure Whirr to start the cluster with a higher number of reducers
> (e.g. the number of tt nodes for instance).
>
> As a summary it seems that most of this questions could be answered if
> whirr would allow for passing configuration file snippets or
> individual Hadoop property configuration values. I there a way to do
> this without having to maintain a custom version of the Hadoop
> installation scripts?

Yes, this is covered by
https://issues.apache.org/jira/browse/WHIRR-55, which I hope to get in
the next release.

Thanks,
Tom

>
> Regards,
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>

Re: Some feedback and some questions

Posted by Olivier Grisel <ol...@ensta.org>.

2011/1/3 Olivier Grisel <ol...@ensta.org>:
>
> 4- I noticed that despite telling pig to "SET default_parallel 10" I
> only get maximum 1 reducers of each of my jobs. Hence I suspect that
> the Hadoop configuration of Whirr is limiting this. How can I
> configure Whirr to start the cluster with a higher number of reducers
> (e.g. the number of tt nodes for instance).

Actually this one is a mistake: I have currently a job running with 7
concurrent reducers: so everything is alright (I guess).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel