You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Haithem Turki <tu...@gmail.com> on 2016/05/26 21:56:14 UTC

IGFS YARN setup

Hello,

I'm interested in using IGFS as a Hadoop caching layer - the usecase
revolves largely around Spark jobs running on a YARN cluster that persist
data to S3 (although I have some non-Spark stuff running too so would
ideally integrate at the Hadoop filesystem layer). I'm excited about the
potential speedups that this could bring :)

I took a stab at deploying this for the first time, and had some questions:

- I ideally was envisioning deploying nodes via YARN to take advantage of
dynamic scaling and use any available memory on the cluster, I wanted to
make sure that this was indeed a supported workflow / on the roadmap as I
hit a few bumps along the way:
* I ended up needing to dump pretty much all of my Hadoop-related jars to
HDFS for my nodes to startup correctly (or else I was getting
ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite
classes not being there). Am I doing something horribly wrong / have you
guys considered package a fat jar for the non-hadoop dependencies at least?
* Couldn't specify the yarn queue despite attempting to
set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (
https://issues.apache.org/jira/browse/IGNITE-2738?)
* Seems like dynamic allocation isn't supported? Wanted to get a sense of
whether this was in the roadmap
* Since YARN allocates containers at random it's pretty onerous to figure
out which hostnames have Ignite nodes running on them and specifying those
in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port
conflicts if multiple nodes are running on the same machine) and I guess I
can set up a reverse proxy so that I can point towards a stable URL but
it's not great / doesn't scale well so I was wondering if there were other
suggestions on how to configure discovery (maybe spin up a local node
outside of YARN that leverages the cluster discovery?)
* I also wasn't clear on how cluster routing/balancing worked. If I specify
my hadoop jobs to point at host1:10500 via TCP, will all read/writes route
through that node or do the reads/writes somehow get balanced?

Or is this completely crazy / should I just deploy IGFS outside of YARN?

- Is there a way of configuring the local filesystem as a tiered storage
layer (or is it on the roadmap)? Usecase is that even reading from an SSD
is much faster than S3.

Thanks in advance!
- Haithem

Re: IGFS YARN setup

Posted by Haithem Turki <tu...@gmail.com>.

Thanks Nikolai! Re: dynamic allocation - I was imagining that we would be
able to dynamically scale the number of Ignite nodes up and down depending
on the free resources available on your YARN cluster (similar to what Spark
does). Bonus points if it leverages the YARN auxiliary service framework to
persist/recover from local disk in case of preemption (Spark also does this
with the external shuffle service).

Re: IGFS YARN setup

Posted by Nikolai Tikhonov <nt...@apache.org>.

Hi, Haithem Turki!

* Seems like dynamic allocation isn't supported? Wanted to get a sense of
>> whether this was in the roadmap
>>
>
Could you please describe more about what you want from a dynamic
allocation?


> * Since YARN allocates containers at random it's pretty onerous to figure
>> out which hostnames have Ignite nodes running on them and specifying those
>> in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port
>> conflicts if multiple nodes are running on the same machine) and I guess I
>> can set up a reverse proxy so that I can point towards a stable URL but
>> it's not great / doesn't scale well so I was wondering if there were other
>> suggestions on how to configure discovery (maybe spin up a local node
>> outside of YARN that leverages the cluster discovery?)
>>
>
I've created ticket and you can track status there [1]. Now I don't see
solution which look more elegant than you describe. Yes, you can start
ignite node outside of YARN cluster and use it as a stable URL.

[1]  https://issues.apache.org/jira/browse/IGNITE-3214

Re: IGFS YARN setup

Posted by Haithem Turki <tu...@gmail.com>.

I also had to create a "default-config.xml" block and point towards it in
HDFS via "IGNITE_XML_CONFIG" and then add the following property to the
"igfs-data" bean, not sure if that's expected...

<property name="affinityMapper">

<bean class="org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper">

<!— How many sequential blocks will be stored on the same node. -->

<constructor-arg value="512"/>

</bean>

</property>


On Thu, May 26, 2016 at 5:56 PM, Haithem Turki <tu...@gmail.com>
wrote:

> Hello,
>
> I'm interested in using IGFS as a Hadoop caching layer - the usecase
> revolves largely around Spark jobs running on a YARN cluster that persist
> data to S3 (although I have some non-Spark stuff running too so would
> ideally integrate at the Hadoop filesystem layer). I'm excited about the
> potential speedups that this could bring :)
>
> I took a stab at deploying this for the first time, and had some questions:
>
> - I ideally was envisioning deploying nodes via YARN to take advantage of
> dynamic scaling and use any available memory on the cluster, I wanted to
> make sure that this was indeed a supported workflow / on the roadmap as I
> hit a few bumps along the way:
> * I ended up needing to dump pretty much all of my Hadoop-related jars to
> HDFS for my nodes to startup correctly (or else I was getting
> ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite
> classes not being there). Am I doing something horribly wrong / have you
> guys considered package a fat jar for the non-hadoop dependencies at least?
> * Couldn't specify the yarn queue despite attempting to
> set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (
> https://issues.apache.org/jira/browse/IGNITE-2738?)
> * Seems like dynamic allocation isn't supported? Wanted to get a sense of
> whether this was in the roadmap
> * Since YARN allocates containers at random it's pretty onerous to figure
> out which hostnames have Ignite nodes running on them and specifying those
> in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port
> conflicts if multiple nodes are running on the same machine) and I guess I
> can set up a reverse proxy so that I can point towards a stable URL but
> it's not great / doesn't scale well so I was wondering if there were other
> suggestions on how to configure discovery (maybe spin up a local node
> outside of YARN that leverages the cluster discovery?)
> * I also wasn't clear on how cluster routing/balancing worked. If I
> specify my hadoop jobs to point at host1:10500 via TCP, will all
> read/writes route through that node or do the reads/writes somehow get
> balanced?
>
> Or is this completely crazy / should I just deploy IGFS outside of YARN?
>
> - Is there a way of configuring the local filesystem as a tiered storage
> layer (or is it on the roadmap)? Usecase is that even reading from an SSD
> is much faster than S3.
>
> Thanks in advance!
> - Haithem
>