You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Jan Warchoł <ja...@codilime.com> on 2014/07/14 16:03:05 UTC

changing split size in Hadoop configuration

Hello,

I recently got "Split metadata size exceeded 10000000" error when running
Cascading jobs with very big joins.  I found that I should change
mapreduce.jobtracker.split.metainfo.maxsize property in hadoop
configuration by adding this to the mapred-site.xml file:

  <property>
    <!-- allow more space for split metadata (default is 10000000) -->
    <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
    <value>1000000000</value>
  </property>

but it didn't seem to have any effect - I'm probably doing something wrong.

Where should I add this change so that is has the desired effect?  Do I
understand correctly that jobtracker restart is required after making the
change? The cluster I'm working on has Hadoop 1.0.4.

thanks for any help,
-- 
*Jan Warchoł*
*Software Engineer*

-----------------------------------------
M: +48 509 078 203
 E: jan.warchol@codilime.com
-----------------------------------------
CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 0000388871.
Tax identification number (NIP) 5272657478. Statistical number
(REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Jan Warchoł <ja...@codilime.com>.
Hi,

On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>

ok.


> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>

Well, the funny thing is that it did work when i made the change in a
pseudo-distributed "cluster" on my laptop, but it didn't have any effect
when i tried it on the real cluster.  I probably changed wrong
configuration file.  How do i check where the configuration that is
actually used for (re)starting JobTracker comes from?


On Mon, Jul 14, 2014 at 8:54 PM, Bertrand Dechoux <de...@gmail.com>
wrote:

> For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is
> related to the size of the file containing the information describing the
> input splits. It is not related directly to the volume of data but to the
> number of splits which might explode when using too many (small) files.
> It's basically a safeguard. Alternatively, you might want to reduce the
> number of splits ; raising the block size is one way to do it.
>

Ok, i'll keep this in mind and try changing block size if necessary.

thanks,
-- 
*Jan Warchoł*
*Software Engineer*

-----------------------------------------
M: +48 509 078 203
 E: jan.warchol@codilime.com
-----------------------------------------
CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 0000388871.
Tax identification number (NIP) 5272657478. Statistical number
(REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Jan Warchoł <ja...@codilime.com>.
Hi,

On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>

ok.


> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>

Well, the funny thing is that it did work when i made the change in a
pseudo-distributed "cluster" on my laptop, but it didn't have any effect
when i tried it on the real cluster.  I probably changed wrong
configuration file.  How do i check where the configuration that is
actually used for (re)starting JobTracker comes from?


On Mon, Jul 14, 2014 at 8:54 PM, Bertrand Dechoux <de...@gmail.com>
wrote:

> For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is
> related to the size of the file containing the information describing the
> input splits. It is not related directly to the volume of data but to the
> number of splits which might explode when using too many (small) files.
> It's basically a safeguard. Alternatively, you might want to reduce the
> number of splits ; raising the block size is one way to do it.
>

Ok, i'll keep this in mind and try changing block size if necessary.

thanks,
-- 
*Jan Warchoł*
*Software Engineer*

-----------------------------------------
M: +48 509 078 203
 E: jan.warchol@codilime.com
-----------------------------------------
CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 0000388871.
Tax identification number (NIP) 5272657478. Statistical number
(REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Jan Warchoł <ja...@codilime.com>.
Hi,

On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>

ok.


> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>

Well, the funny thing is that it did work when i made the change in a
pseudo-distributed "cluster" on my laptop, but it didn't have any effect
when i tried it on the real cluster.  I probably changed wrong
configuration file.  How do i check where the configuration that is
actually used for (re)starting JobTracker comes from?


On Mon, Jul 14, 2014 at 8:54 PM, Bertrand Dechoux <de...@gmail.com>
wrote:

> For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is
> related to the size of the file containing the information describing the
> input splits. It is not related directly to the volume of data but to the
> number of splits which might explode when using too many (small) files.
> It's basically a safeguard. Alternatively, you might want to reduce the
> number of splits ; raising the block size is one way to do it.
>

Ok, i'll keep this in mind and try changing block size if necessary.

thanks,
-- 
*Jan Warchoł*
*Software Engineer*

-----------------------------------------
M: +48 509 078 203
 E: jan.warchol@codilime.com
-----------------------------------------
CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 0000388871.
Tax identification number (NIP) 5272657478. Statistical number
(REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Jan Warchoł <ja...@codilime.com>.
Hi,

On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>

ok.


> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>

Well, the funny thing is that it did work when i made the change in a
pseudo-distributed "cluster" on my laptop, but it didn't have any effect
when i tried it on the real cluster.  I probably changed wrong
configuration file.  How do i check where the configuration that is
actually used for (re)starting JobTracker comes from?


On Mon, Jul 14, 2014 at 8:54 PM, Bertrand Dechoux <de...@gmail.com>
wrote:

> For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is
> related to the size of the file containing the information describing the
> input splits. It is not related directly to the volume of data but to the
> number of splits which might explode when using too many (small) files.
> It's basically a safeguard. Alternatively, you might want to reduce the
> number of splits ; raising the block size is one way to do it.
>

Ok, i'll keep this in mind and try changing block size if necessary.

thanks,
-- 
*Jan Warchoł*
*Software Engineer*

-----------------------------------------
M: +48 509 078 203
 E: jan.warchol@codilime.com
-----------------------------------------
CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
Capital City of Warsaw, XII Commercial Department of the National Court
Register. Entered into National Court Register under No. KRS 0000388871.
Tax identification number (NIP) 5272657478. Statistical number
(REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Bertrand Dechoux <de...@gmail.com>.
For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is related
to the size of the file containing the information describing the input
splits. It is not related directly to the volume of data but to the number
of splits which might explode when using too many (small) files. It's
basically a safeguard. Alternatively, you might want to reduce the number
of splits ; raising the block size is one way to do it.

Bertrand Dechoux


On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>
> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>
> Sent from my iPhone
>
> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
>
> Hello,
>
> I recently got "Split metadata size exceeded 10000000" error when running
> Cascading jobs with very big joins.  I found that I should change
> mapreduce.jobtracker.split.metainfo.maxsize property in hadoop
> configuration by adding this to the mapred-site.xml file:
>
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
>
> but it didn't seem to have any effect - I'm probably doing something wrong.
>
> Where should I add this change so that is has the desired effect?  Do I
> understand correctly that jobtracker restart is required after making the
> change? The cluster I'm working on has Hadoop 1.0.4.
>
> thanks for any help,
> --
> *Jan Warchoł*
> *Software Engineer*
> <clr[1][14].png>
>
> -----------------------------------------
> M: +48 509 078 203
>  E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
> 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
> Capital City of Warsaw, XII Commercial Department of the National Court
> Register. Entered into National Court Register under No. KRS 0000388871.
> Tax identification number (NIP) 5272657478. Statistical number
> (REGON) 142974628.
>
>

Re: changing split size in Hadoop configuration

Posted by Bertrand Dechoux <de...@gmail.com>.
For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is related
to the size of the file containing the information describing the input
splits. It is not related directly to the volume of data but to the number
of splits which might explode when using too many (small) files. It's
basically a safeguard. Alternatively, you might want to reduce the number
of splits ; raising the block size is one way to do it.

Bertrand Dechoux


On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>
> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>
> Sent from my iPhone
>
> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
>
> Hello,
>
> I recently got "Split metadata size exceeded 10000000" error when running
> Cascading jobs with very big joins.  I found that I should change
> mapreduce.jobtracker.split.metainfo.maxsize property in hadoop
> configuration by adding this to the mapred-site.xml file:
>
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
>
> but it didn't seem to have any effect - I'm probably doing something wrong.
>
> Where should I add this change so that is has the desired effect?  Do I
> understand correctly that jobtracker restart is required after making the
> change? The cluster I'm working on has Hadoop 1.0.4.
>
> thanks for any help,
> --
> *Jan Warchoł*
> *Software Engineer*
> <clr[1][14].png>
>
> -----------------------------------------
> M: +48 509 078 203
>  E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
> 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
> Capital City of Warsaw, XII Commercial Department of the National Court
> Register. Entered into National Court Register under No. KRS 0000388871.
> Tax identification number (NIP) 5272657478. Statistical number
> (REGON) 142974628.
>
>

Re: changing split size in Hadoop configuration

Posted by Bertrand Dechoux <de...@gmail.com>.
For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is related
to the size of the file containing the information describing the input
splits. It is not related directly to the volume of data but to the number
of splits which might explode when using too many (small) files. It's
basically a safeguard. Alternatively, you might want to reduce the number
of splits ; raising the block size is one way to do it.

Bertrand Dechoux


On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>
> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>
> Sent from my iPhone
>
> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
>
> Hello,
>
> I recently got "Split metadata size exceeded 10000000" error when running
> Cascading jobs with very big joins.  I found that I should change
> mapreduce.jobtracker.split.metainfo.maxsize property in hadoop
> configuration by adding this to the mapred-site.xml file:
>
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
>
> but it didn't seem to have any effect - I'm probably doing something wrong.
>
> Where should I add this change so that is has the desired effect?  Do I
> understand correctly that jobtracker restart is required after making the
> change? The cluster I'm working on has Hadoop 1.0.4.
>
> thanks for any help,
> --
> *Jan Warchoł*
> *Software Engineer*
> <clr[1][14].png>
>
> -----------------------------------------
> M: +48 509 078 203
>  E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
> 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
> Capital City of Warsaw, XII Commercial Department of the National Court
> Register. Entered into National Court Register under No. KRS 0000388871.
> Tax identification number (NIP) 5272657478. Statistical number
> (REGON) 142974628.
>
>

Re: changing split size in Hadoop configuration

Posted by Bertrand Dechoux <de...@gmail.com>.
For what it's worth, mapreduce.jobtracker.split.metainfo.maxsize is related
to the size of the file containing the information describing the input
splits. It is not related directly to the volume of data but to the number
of splits which might explode when using too many (small) files. It's
basically a safeguard. Alternatively, you might want to reduce the number
of splits ; raising the block size is one way to do it.

Bertrand Dechoux


On Mon, Jul 14, 2014 at 7:50 PM, Adam Kawa <ka...@gmail.com> wrote:

> It sounds like JobTracker setting, so the restart looks to be required.
>
> You verify it in pseudo-distributed mode by setting it to a very low
> value, restarting JT and seeing if you get the exception that prints this
> new value.
>
> Sent from my iPhone
>
> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
>
> Hello,
>
> I recently got "Split metadata size exceeded 10000000" error when running
> Cascading jobs with very big joins.  I found that I should change
> mapreduce.jobtracker.split.metainfo.maxsize property in hadoop
> configuration by adding this to the mapred-site.xml file:
>
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
>
> but it didn't seem to have any effect - I'm probably doing something wrong.
>
> Where should I add this change so that is has the desired effect?  Do I
> understand correctly that jobtracker restart is required after making the
> change? The cluster I'm working on has Hadoop 1.0.4.
>
> thanks for any help,
> --
> *Jan Warchoł*
> *Software Engineer*
> <clr[1][14].png>
>
> -----------------------------------------
> M: +48 509 078 203
>  E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland,
> 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the
> Capital City of Warsaw, XII Commercial Department of the National Court
> Register. Entered into National Court Register under No. KRS 0000388871.
> Tax identification number (NIP) 5272657478. Statistical number
> (REGON) 142974628.
>
>

Re: changing split size in Hadoop configuration

Posted by Adam Kawa <ka...@gmail.com>.
It sounds like JobTracker setting, so the restart looks to be required.

You verify it in pseudo-distributed mode by setting it to a very low value, restarting JT and seeing if you get the exception that prints this new value.

Sent from my iPhone

> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
> 
> Hello,
> 
> I recently got "Split metadata size exceeded 10000000" error when running Cascading jobs with very big joins.  I found that I should change mapreduce.jobtracker.split.metainfo.maxsize property in hadoop configuration by adding this to the mapred-site.xml file:
> 
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
> 
> but it didn't seem to have any effect - I'm probably doing something wrong.
> 
> Where should I add this change so that is has the desired effect?  Do I understand correctly that jobtracker restart is required after making the change? The cluster I'm working on has Hadoop 1.0.4.
> 
> thanks for any help,
> -- 
> Jan Warchoł
> Software Engineer
> <clr[1][14].png>
> -----------------------------------------
> M: +48 509 078 203
> E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland, 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the Capital City of Warsaw, XII Commercial Department of the National Court Register. Entered into National Court Register under No. KRS 0000388871. Tax identification number (NIP) 5272657478. Statistical number (REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Adam Kawa <ka...@gmail.com>.
It sounds like JobTracker setting, so the restart looks to be required.

You verify it in pseudo-distributed mode by setting it to a very low value, restarting JT and seeing if you get the exception that prints this new value.

Sent from my iPhone

> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
> 
> Hello,
> 
> I recently got "Split metadata size exceeded 10000000" error when running Cascading jobs with very big joins.  I found that I should change mapreduce.jobtracker.split.metainfo.maxsize property in hadoop configuration by adding this to the mapred-site.xml file:
> 
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
> 
> but it didn't seem to have any effect - I'm probably doing something wrong.
> 
> Where should I add this change so that is has the desired effect?  Do I understand correctly that jobtracker restart is required after making the change? The cluster I'm working on has Hadoop 1.0.4.
> 
> thanks for any help,
> -- 
> Jan Warchoł
> Software Engineer
> <clr[1][14].png>
> -----------------------------------------
> M: +48 509 078 203
> E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland, 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the Capital City of Warsaw, XII Commercial Department of the National Court Register. Entered into National Court Register under No. KRS 0000388871. Tax identification number (NIP) 5272657478. Statistical number (REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Adam Kawa <ka...@gmail.com>.
It sounds like JobTracker setting, so the restart looks to be required.

You verify it in pseudo-distributed mode by setting it to a very low value, restarting JT and seeing if you get the exception that prints this new value.

Sent from my iPhone

> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
> 
> Hello,
> 
> I recently got "Split metadata size exceeded 10000000" error when running Cascading jobs with very big joins.  I found that I should change mapreduce.jobtracker.split.metainfo.maxsize property in hadoop configuration by adding this to the mapred-site.xml file:
> 
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
> 
> but it didn't seem to have any effect - I'm probably doing something wrong.
> 
> Where should I add this change so that is has the desired effect?  Do I understand correctly that jobtracker restart is required after making the change? The cluster I'm working on has Hadoop 1.0.4.
> 
> thanks for any help,
> -- 
> Jan Warchoł
> Software Engineer
> <clr[1][14].png>
> -----------------------------------------
> M: +48 509 078 203
> E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland, 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the Capital City of Warsaw, XII Commercial Department of the National Court Register. Entered into National Court Register under No. KRS 0000388871. Tax identification number (NIP) 5272657478. Statistical number (REGON) 142974628.

Re: changing split size in Hadoop configuration

Posted by Adam Kawa <ka...@gmail.com>.
It sounds like JobTracker setting, so the restart looks to be required.

You verify it in pseudo-distributed mode by setting it to a very low value, restarting JT and seeing if you get the exception that prints this new value.

Sent from my iPhone

> On 14 jul 2014, at 16:03, Jan Warchoł <ja...@codilime.com> wrote:
> 
> Hello,
> 
> I recently got "Split metadata size exceeded 10000000" error when running Cascading jobs with very big joins.  I found that I should change mapreduce.jobtracker.split.metainfo.maxsize property in hadoop configuration by adding this to the mapred-site.xml file:
> 
>   <property>
>     <!-- allow more space for split metadata (default is 10000000) -->
>     <name>mapreduce.jobtracker.split.metainfo.maxsize</name>
>     <value>1000000000</value>
>   </property>
> 
> but it didn't seem to have any effect - I'm probably doing something wrong.
> 
> Where should I add this change so that is has the desired effect?  Do I understand correctly that jobtracker restart is required after making the change? The cluster I'm working on has Hadoop 1.0.4.
> 
> thanks for any help,
> -- 
> Jan Warchoł
> Software Engineer
> <clr[1][14].png>
> -----------------------------------------
> M: +48 509 078 203
> E: jan.warchol@codilime.com
> -----------------------------------------
> CodiLime Sp. z o.o. - Ltd. company with its registered office in Poland, 01-167 Warsaw, ul. Zawiszy 14/97. Registered by The District Court for the Capital City of Warsaw, XII Commercial Department of the National Court Register. Entered into National Court Register under No. KRS 0000388871. Tax identification number (NIP) 5272657478. Statistical number (REGON) 142974628.