You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Philippe Gassmann <ph...@anyware-tech.com> on 2006/06/20 14:03:51 UTC

Configuration policy

Hi everybody,

I'm a new hadoop user, and I would like to know if it's possible to use
a file in the filesystem as a site config file :
eg : in /etc/hadoop/hadoop-site.xml i have my configuration file, and I
want to tell to Hadoop to use it but I cannot find how to do this.
By reading the Hadoop code (Configuration.java), i guess that it is
impossible.

So here is my question : Why do not add a method
Configuration.addFinalResource(File) and
Configuration.addDefaultResource(File) in the Configuration.java file ?

-- 
Philippe GASSMANN
http://www.anyware-tech.com/

Re: Configuration policy

Posted by Dennis Kubes <nu...@dragonflymc.com>.

If you put the hadoop-site.xml file in the conf directory it will 
automatically get picked up and used by both the
nutch and hadoop servers.

Dennis

Philippe Gassmann wrote:
> Hi everybody,
>
> I'm a new hadoop user, and I would like to know if it's possible to use
> a file in the filesystem as a site config file :
> eg : in /etc/hadoop/hadoop-site.xml i have my configuration file, and I
> want to tell to Hadoop to use it but I cannot find how to do this.
> By reading the Hadoop code (Configuration.java), i guess that it is
> impossible.
>
> So here is my question : Why do not add a method
> Configuration.addFinalResource(File) and
> Configuration.addDefaultResource(File) in the Configuration.java file ?
>
>

Re: Configuration policy

Posted by Owen O'Malley <ow...@yahoo-inc.com>.

On Jun 20, 2006, at 5:10 PM, Paul Sutter wrote:

> Thanks very much for the explanation, and to confirm I will repeat it:
>
> The first occurence of a parameter is used, and the search order is:
>
> hadoop-site.xml, then
> job.xml, then
> mapred-default.xml, then
> hadoop-default.xml
>
> Thats great, and it explains behavior that had been confusing before.

Exactly correct. One other piece that can cause confusion is that all 
of the files are found via the java class path. And they are present 
both in the conf directory in the distribution AND the hadoop-*.jar 
file.

One side effect of this is that I recommend never having a copy of 
hadoop-default.xml in your config directory. That is the one 
configuration file that you always want updated automatically when you 
update your distribution.

For the record, I like setting up my hadoop directories like:

$hadoop_prefix/hadoop-0.4-dev     # distribution directory
$hadoop_prefix/conf                          # local configuration
$hadoop_prefix/current                     # sym link over to the 
distribution directory
$hadoop_prefix/run/log                     # log directory
$hadoop_prefix/run/pid                     # pid directory
$hadoop_prefix/run/mapred             # map-reduce server directory
$hadoop_prefix/run/dfs/{data,name} # dfs server directories

-- Owen

Re: Configuration policy

Posted by Paul Sutter <su...@gmail.com>.

Thanks very much for the explanation, and to confirm I will repeat it:

The first occurence of a parameter is used, and the search order is:

hadoop-site.xml, then
job.xml, then
mapred-default.xml, then
hadoop-default.xml

Thats great, and it explains behavior that had been confusing before.

It would indeed be good to rename mapred-default.xml to something that makes
sense (I would suggest changing both the "mapred" part, and the "default"
part. "mapred" says little to set the file apart from "hadoop", and
"default" doesnt do a good job of describing something that is site-specific
instead of factory default).

On 6/20/06, Owen O'Malley <ow...@yahoo-inc.com> wrote:
>
>
> On Jun 20, 2006, at 9:29 AM, Paul Sutter wrote:
>
> > Speaking of configuration, is there any clear definition for the
> > purpose of
> > mapred-default.xml? My understanding is that its an alternate,
> > misnamed,
> > site-local configuration, but we're not sure what to do with it.
> >
> > Right now, we make all of our changes to hadoop-site.xml, then copy
> > that
> > file to mapred-default.xml because we've heard that sometimes, that
> > file
> > gets checked instead of hadoop-site.xml.
> >
> > Any help appreciated
>
> My general approach is that only things that the user/application
> should never change are in hadoop-site. Largely, this is limited to the
> namenode/jobtracker addresses, port, and directories. Everything else
> goes into mapred-default.xml. This includes things like:
>
> dfs.block.size
> io.sort.factor
> io.sort.mb
> etc....
>
> This happens because of the load order of the config files:
>
> hadoop-default.xml, mapred-default.xml, job.xml, hadoop-site.xml.
>
> so job.xml will override the default files, but NOT the hadoop-site. I
> think that mapred-default would be better named site-default or
> something.
>
> -- Owen
>
>

Re: Configuration policy

Posted by Philippe Gassmann <ph...@anyware-tech.com>.

Owen O'Malley wrote:
>
> On Jun 21, 2006, at 1:00 AM, Philippe Gassmann wrote:
>
>> In the hadoop-site.xml file you have : dfs.data.dir
>> This property is used by the servers to indicate where to store data.
>> This property is also used by the client. When creating
>> DFSClient.DFSOutputStream, a temporary file is created in
>> ${dfs.data.dir}/tmp. I think this is not the right thing to do, because
>> you need to maintain on the client machine, the same directory structure
>> (including rights on them) as in the servers. And when, for instance,
>> you try to install an hadoop server and a client on the same machine for
>> testing purpose, you have to cope with terrible things...
>
> I agree that it should be a different variable. Your first reason is
> the compelling one, in my opinion. Clients and servers don't
> necessarily have the same directory structure. Is there already a jira
> issue open on this? If not, there probably should be.
>
> -- Owen
>
A JIRA issue already exists : http://issues.apache.org/jira/browse/HADOOP-88

-- 
Philippe GASSMANN
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com/

Re: Configuration policy

Posted by Owen O'Malley <ow...@yahoo-inc.com>.

On Jun 21, 2006, at 1:00 AM, Philippe Gassmann wrote:

> In the hadoop-site.xml file you have : dfs.data.dir
> This property is used by the servers to indicate where to store data.
> This property is also used by the client. When creating
> DFSClient.DFSOutputStream, a temporary file is created in
> ${dfs.data.dir}/tmp. I think this is not the right thing to do, because
> you need to maintain on the client machine, the same directory 
> structure
> (including rights on them) as in the servers. And when, for instance,
> you try to install an hadoop server and a client on the same machine 
> for
> testing purpose, you have to cope with terrible things...

I agree that it should be a different variable. Your first reason is 
the compelling one, in my opinion. Clients and servers don't 
necessarily have the same directory structure. Is there already a jira 
issue open on this? If not, there probably should be.

-- Owen

Re: canot open *.crc file

Posted by Doug Cutting <cu...@apache.org>.

anton@orbita1.ru wrote:
> My dfs don't create crc file...

What version of Hadoop are you running?  There was a bug, fixed about 
two months ago, that would have caused this.

http://svn.apache.org/viewvc?view=rev&revision=398010

Doug

RE: canot open *.crc file

Posted by an...@orbita1.ru.

My dfs don't create crc file...
Source code of my test class:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.NutchConfiguration;

import java.io.IOException;

/**
 * @author Anton Potekhin
 * @date: 26.06.2006 11:48:24
 */

public class CreateFile {

    public static void main(String argv[]) {
        if ("create".equals(argv[0])) {
            Configuration conf = NutchConfiguration.create();
            JobConf job = new NutchJob(conf);
            try {
                FileSystem fs = FileSystem.get(job);
                fs.createNewFile(new Path("test", "done"));
            } catch (IOException e) {
                System.err.println(e.toString());
                return;
            }
        }
        if ("get".equals(argv[0])) {
            Configuration conf = NutchConfiguration.create();
            JobConf job = new NutchJob(conf);
            try {
                FileSystem fs = FileSystem.get(job);
                fs.copyToLocalFile(new Path("test", "done"),new
Path("done"));
            } catch (IOException e) {
                System.err.println(e.toString());
                return;
            }
        }

    }
}


1) Start this class with parameter "create".
2) Then:
	# bin/hadoop dfs -ls test
	Found 1 items
	/user/root/test/done    <r 2>   0

	And I don't see crc file...
3) Start test class with parameter "get". In results I get file done and
.done.crc in local filesystem and error:

060626 010145 Client connection to 127.0.0.1:9000: starting
060626 010145 Problem opening checksum file: test/done.  Ignoring with
exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot
open filename /user/root/test/.done.crc
        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)
.
	

Why I see this error? 


-----Original Message-----
From: Konstantin Shvachko [mailto:shv@yahoo-inc.com] 
Sent: Saturday, June 24, 2006 2:41 AM
To: hadoop-user@lucene.apache.org
Subject: Re: canot open *.crc file
Importance: High

The crc files are generated automatically, you do not need to create them.
It looks like you do everything right, and it should work.
Are you trying to read the crc files?
The crc files name pattern is ".fname.crc"
in you case it should be ".done.crc" rather than "done.crc"
--Konstantin

anton@orbita1.ru wrote:

>I create file on dfs (for example filename "done"). After I try copy this
>file from dfs to local filesystem. In result I get this file in local
>filesystem and error:
>
>Problem opening checksum file: /user/root/crawl/done.  Ignoring with
>exception org.apache.hadoop.ipc.RemoteException: jav
>a.io.IOException: Cannot open filename /user/root/crawl/done.crc
>        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
>        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
>        at
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
>pl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:585)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) 
>
>
>for create file I use code:
>	FileSystem fs = ...
>	fs_.createNewFile(new Path(segments[i], "already_indexed"));
>
>for copy file to local filesystem I use code:
>	fs.copyToLocalFile(...,...);
>
>How create crc file? 
>Why crc file is not created automatically when making a file on dfs? 
>How correctly create a file on dfs?
>  
>

Re: canot open *.crc file

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

The crc files are generated automatically, you do not need to create them.
It looks like you do everything right, and it should work.
Are you trying to read the crc files?
The crc files name pattern is ".fname.crc"
in you case it should be ".done.crc" rather than "done.crc"
--Konstantin

anton@orbita1.ru wrote:

>I create file on dfs (for example filename "done"). After I try copy this
>file from dfs to local filesystem. In result I get this file in local
>filesystem and error:
>
>Problem opening checksum file: /user/root/crawl/done.  Ignoring with
>exception org.apache.hadoop.ipc.RemoteException: jav
>a.io.IOException: Cannot open filename /user/root/crawl/done.crc
>        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
>        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
>        at
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
>pl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:585)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) 
>
>
>for create file I use code:
>	FileSystem fs = ...
>	fs_.createNewFile(new Path(segments[i], "already_indexed"));
>
>for copy file to local filesystem I use code:
>	fs.copyToLocalFile(...,...);
>
>How create crc file? 
>Why crc file is not created automatically when making a file on dfs? 
>How correctly create a file on dfs?
>  
>

canot open *.crc file

Posted by an...@orbita1.ru.

I create file on dfs (for example filename "done"). After I try copy this
file from dfs to local filesystem. In result I get this file in local
filesystem and error:

Problem opening checksum file: /user/root/crawl/done.  Ignoring with
exception org.apache.hadoop.ipc.RemoteException: jav
a.io.IOException: Cannot open filename /user/root/crawl/done.crc
        at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) 


for create file I use code:
	FileSystem fs = ...
	fs_.createNewFile(new Path(segments[i], "already_indexed"));

for copy file to local filesystem I use code:
	fs.copyToLocalFile(...,...);

How create crc file? 
Why crc file is not created automatically when making a file on dfs? 
How correctly create a file on dfs?

Re: Configuration policy

Posted by Philippe Gassmann <ph...@anyware-tech.com>.

Owen O'Malley wrote:
>
> My general approach is that only things that the user/application
> should never change are in hadoop-site. Largely, this is limited to
> the namenode/jobtracker addresses, port, and directories.
In the hadoop-site.xml file you have : dfs.data.dir
This property is used by the servers to indicate where to store data.
This property is also used by the client. When creating
DFSClient.DFSOutputStream, a temporary file is created in
${dfs.data.dir}/tmp. I think this is not the right thing to do, because
you need to maintain on the client machine, the same directory structure
(including rights on them) as in the servers. And when, for instance,
you try to install an hadoop server and a client on the same machine for
testing purpose, you have to cope with terrible things...

Cheers,
Philippe.

Re: Configuration policy

Posted by Owen O'Malley <ow...@yahoo-inc.com>.

On Jun 20, 2006, at 9:29 AM, Paul Sutter wrote:

> Speaking of configuration, is there any clear definition for the 
> purpose of
> mapred-default.xml? My understanding is that its an alternate, 
> misnamed,
> site-local configuration, but we're not sure what to do with it.
>
> Right now, we make all of our changes to hadoop-site.xml, then copy 
> that
> file to mapred-default.xml because we've heard that sometimes, that 
> file
> gets checked instead of hadoop-site.xml.
>
> Any help appreciated

My general approach is that only things that the user/application 
should never change are in hadoop-site. Largely, this is limited to the 
namenode/jobtracker addresses, port, and directories. Everything else 
goes into mapred-default.xml. This includes things like:

dfs.block.size
io.sort.factor
io.sort.mb
etc....

This happens because of the load order of the config files:

hadoop-default.xml, mapred-default.xml, job.xml, hadoop-site.xml.

so job.xml will override the default files, but NOT the hadoop-site. I 
think that mapred-default would be better named site-default or 
something.

-- Owen

Re: Configuration policy

Posted by Dennis Kubes <nu...@dragonflymc.com>.

For what I have looked at there isn't a time when mapred-default.xml 
would get red where hadoop-site.xml would not.  I am not 100% sure on 
this but I know that we don't use the mapred-default.xml file and 
haven't had any problems.

Dennis

Paul Sutter wrote:
> Speaking of configuration, is there any clear definition for the 
> purpose of
> mapred-default.xml? My understanding is that its an alternate, misnamed,
> site-local configuration, but we're not sure what to do with it.
>
> Right now, we make all of our changes to hadoop-site.xml, then copy that
> file to mapred-default.xml because we've heard that sometimes, that file
> gets checked instead of hadoop-site.xml.
>
> Any help appreciated
>
> Paul
>
>
> On 6/20/06, Benjamin Reed <br...@yahoo-inc.com> wrote:
>>
>> I ran into the same problem. Part of the patch for Issue 303 addresses
>> this.
>>
>> http://issues.apache.org/jira/browse/HADOOP-303
>>
>> ben
>>
>> On Tuesday 20 June 2006 05:03, Philippe Gassmann wrote:
>> > Hi everybody,
>> >
>> > I'm a new hadoop user, and I would like to know if it's possible to 
>> use
>> > a file in the filesystem as a site config file :
>> > eg : in /etc/hadoop/hadoop-site.xml i have my configuration file, 
>> and I
>> > want to tell to Hadoop to use it but I cannot find how to do this.
>> > By reading the Hadoop code (Configuration.java), i guess that it is
>> > impossible.
>> >
>> > So here is my question : Why do not add a method
>> > Configuration.addFinalResource(File) and
>> > Configuration.addDefaultResource(File) in the Configuration.java 
>> file ?
>>
>

Re: Configuration policy

Posted by Paul Sutter <su...@gmail.com>.

Speaking of configuration, is there any clear definition for the purpose of
mapred-default.xml? My understanding is that its an alternate, misnamed,
site-local configuration, but we're not sure what to do with it.

Right now, we make all of our changes to hadoop-site.xml, then copy that
file to mapred-default.xml because we've heard that sometimes, that file
gets checked instead of hadoop-site.xml.

Any help appreciated

Paul

On 6/20/06, Benjamin Reed <br...@yahoo-inc.com> wrote:
>
> I ran into the same problem. Part of the patch for Issue 303 addresses
> this.
>
> http://issues.apache.org/jira/browse/HADOOP-303
>
> ben
>
> On Tuesday 20 June 2006 05:03, Philippe Gassmann wrote:
> > Hi everybody,
> >
> > I'm a new hadoop user, and I would like to know if it's possible to use
> > a file in the filesystem as a site config file :
> > eg : in /etc/hadoop/hadoop-site.xml i have my configuration file, and I
> > want to tell to Hadoop to use it but I cannot find how to do this.
> > By reading the Hadoop code (Configuration.java), i guess that it is
> > impossible.
> >
> > So here is my question : Why do not add a method
> > Configuration.addFinalResource(File) and
> > Configuration.addDefaultResource(File) in the Configuration.java file ?
>

Re: Configuration policy

Posted by Benjamin Reed <br...@yahoo-inc.com>.

I ran into the same problem. Part of the patch for Issue 303 addresses this.

http://issues.apache.org/jira/browse/HADOOP-303

ben

On Tuesday 20 June 2006 05:03, Philippe Gassmann wrote:
> Hi everybody,
>
> I'm a new hadoop user, and I would like to know if it's possible to use
> a file in the filesystem as a site config file :
> eg : in /etc/hadoop/hadoop-site.xml i have my configuration file, and I
> want to tell to Hadoop to use it but I cannot find how to do this.
> By reading the Hadoop code (Configuration.java), i guess that it is
> impossible.
>
> So here is my question : Why do not add a method
> Configuration.addFinalResource(File) and
> Configuration.addDefaultResource(File) in the Configuration.java file ?