You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Ethan Xu <et...@us.ibm.com> on 2016/02/04 18:45:03 UTC

Fixed hadoop configuration to run dml on large dataset

Thanks to help from the team, we fixed a hadoop classpath configuration so 
dml successfully invokes MapReduce jobs.

I'm carrying the discussion here in case other people ran into the same 
problem.

----Problem description----
I was running a simple dml to carry out data transformation on a hadoop 
cluster (hadoop 2.0.0 cdh4.2.1). The script ran successfully on 1GB data, 
but throws an error on ~30GB of data. 

It looks like SystemML didn't need to invoke MapReduce jobs on the small 
data set with console output ' Number of executed MR Jobs: 0'. On the 
larger data it attempted to run MR and threw the following error:

...
Caused by: java.lang.ClassNotFoundException: Class 
com.hadoop.compression.lzo.LzoCodec not found
        at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
        at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:127)
        ... 38 more


----Solution----
The missing class com.hadoop.compression.lzo.LzoCodec is contained in the 
lzo-hadoop jar file:
http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22com.hadoop.compression.lzo.LzoCodec%22

Installation and configuration information of LZO Parcel can be found 
here:
http://www.cloudera.com/documentation/archive/manager/4-x/4-7-3/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html
and this stackoverflow solution:
http://stackoverflow.com/questions/23441142/class-com-hadoop-compression-lzo-lzocodec-not-found-for-spark-on-cdh-5

For my case it turns out we have the lzo jar but it was not included in 
the classpath. Explicitly pointing to the jar at dml job submission via 
-libjars (https://hadoop.apache.org/docs/r1.2.1/commands_manual.html#jar) 
did the trick: 

hadoop jar ./SystemML.jar -libjars <path to lzo jar>/hadoop-lzo-0.4.15.jar 
-f ./transform.dml -nvargs X=<path on HDFS>/file-to-transform

Ethan


Re: Fixed hadoop configuration to run dml on large dataset

Posted by Deron Eriksson <de...@gmail.com>.
Ethan, thank you for posting the fix to the LZO configuration issue.

Deron


On Thu, Feb 4, 2016 at 9:45 AM, Ethan Xu <et...@us.ibm.com> wrote:

> Thanks to help from the team, we fixed a hadoop classpath configuration so
> dml successfully invokes MapReduce jobs.
>
> I'm carrying the discussion here in case other people ran into the same
> problem.
>
> ----Problem description----
> I was running a simple dml to carry out data transformation on a hadoop
> cluster (hadoop 2.0.0 cdh4.2.1). The script ran successfully on 1GB data,
> but throws an error on ~30GB of data.
>
> It looks like SystemML didn't need to invoke MapReduce jobs on the small
> data set with console output ' Number of executed MR Jobs: 0'. On the
> larger data it attempted to run MR and threw the following error:
>
> ...
> Caused by: java.lang.ClassNotFoundException: Class
> com.hadoop.compression.lzo.LzoCodec not found
>         at
>
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
>         at
>
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:127)
>         ... 38 more
>
>
> ----Solution----
> The missing class com.hadoop.compression.lzo.LzoCodec is contained in the
> lzo-hadoop jar file:
>
> http://search.maven.org/#search%7Cga%7C1%7Cfc%3A%22com.hadoop.compression.lzo.LzoCodec%22
>
> Installation and configuration information of LZO Parcel can be found
> here:
>
> http://www.cloudera.com/documentation/archive/manager/4-x/4-7-3/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html
> and this stackoverflow solution:
>
> http://stackoverflow.com/questions/23441142/class-com-hadoop-compression-lzo-lzocodec-not-found-for-spark-on-cdh-5
>
> For my case it turns out we have the lzo jar but it was not included in
> the classpath. Explicitly pointing to the jar at dml job submission via
> -libjars (https://hadoop.apache.org/docs/r1.2.1/commands_manual.html#jar)
> did the trick:
>
> hadoop jar ./SystemML.jar -libjars <path to lzo jar>/hadoop-lzo-0.4.15.jar
> -f ./transform.dml -nvargs X=<path on HDFS>/file-to-transform
>
> Ethan
>
>