You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by rab ra <ra...@gmail.com> on 2013/08/21 14:09:10 UTC

running map tasks in remote node

Hello,

Here is the new bie question of the day.

For one of my use cases, I want to use hadoop map reduce without HDFS.
Here, I will have a text file containing a list of file names to process.
Assume that I have 10 lines (10 files to process) in the input text file
and I wish to generate 10 map tasks and execute them in parallel in 10
nodes. I started with basic tutorial on hadoop and could setup single node
hadoop cluster and successfully tested wordcount code.

Now, I took two machines A (master) and B (slave). I did the below
configuration in these machines to setup a two node cluster.

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
          <name>dfs.replication</name>
          <value>1</value>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/tmp/hadoop-bala/dfs/name</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/tmp/hadoop-bala/dfs/data</value>
</property>
<property>
     <name>mapred.job.tracker</name>
    <value>A:9001</value>
</property>

</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
            <name>mapred.job.tracker</name>
            <value>A:9001</value>
</property>
<property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
         <property>
                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>
        </property>
</configuration>


In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
another file called ‘masters’ wherein an entry ‘A’ is there.

I have kept my input file at A. I see the map method process the input file
line by line but they are all processed in A. Ideally, I would expect those
processing to take place in B.

Can anyone highlight where I am going wrong?

 regards
rab

Re: running map tasks in remote node

Posted by Harsh J <ha...@cloudera.com>.

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <ra...@gmail.com> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure
>> that every file under '/share_data/myfolder' won't be split and sent to the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracker.
>> 6) Obviously when your data is big, the NFS share will be your bottleneck.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with the
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does not
>> allow us to configure and use cloud nodes as HDFS.  However, I would like to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a file
>> transfer and process it.  However, I get an error in the map saying that the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present in
>> my master but not in the slave nodes. When I copy this input directory to
>> slave nodes, it works fine. I am not able to figure out how to fix this and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map and
>> map method contains contents of the input file. This should be fine for the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block by
>> block in HDFS, then you need to use the WholeFileInputFormat (google this
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, then
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>



-- 
Harsh J

Re: running map tasks in remote node

Posted by Harsh J <ha...@cloudera.com>.

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <ra...@gmail.com> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure
>> that every file under '/share_data/myfolder' won't be split and sent to the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracker.
>> 6) Obviously when your data is big, the NFS share will be your bottleneck.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with the
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does not
>> allow us to configure and use cloud nodes as HDFS.  However, I would like to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a file
>> transfer and process it.  However, I get an error in the map saying that the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present in
>> my master but not in the slave nodes. When I copy this input directory to
>> slave nodes, it works fine. I am not able to figure out how to fix this and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map and
>> map method contains contents of the input file. This should be fine for the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block by
>> block in HDFS, then you need to use the WholeFileInputFormat (google this
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, then
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>



-- 
Harsh J

Re: running map tasks in remote node

Posted by Harsh J <ha...@cloudera.com>.

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <ra...@gmail.com> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure
>> that every file under '/share_data/myfolder' won't be split and sent to the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracker.
>> 6) Obviously when your data is big, the NFS share will be your bottleneck.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with the
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does not
>> allow us to configure and use cloud nodes as HDFS.  However, I would like to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a file
>> transfer and process it.  However, I get an error in the map saying that the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present in
>> my master but not in the slave nodes. When I copy this input directory to
>> slave nodes, it works fine. I am not able to figure out how to fix this and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map and
>> map method contains contents of the input file. This should be fine for the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block by
>> block in HDFS, then you need to use the WholeFileInputFormat (google this
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, then
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>



-- 
Harsh J

Re: running map tasks in remote node

Posted by Harsh J <ha...@cloudera.com>.

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.

On Sun, Aug 25, 2013 at 7:59 PM, rab ra <ra...@gmail.com> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure
>> that every file under '/share_data/myfolder' won't be split and sent to the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracker.
>> 6) Obviously when your data is big, the NFS share will be your bottleneck.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with the
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does not
>> allow us to configure and use cloud nodes as HDFS.  However, I would like to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a file
>> transfer and process it.  However, I get an error in the map saying that the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present in
>> my master but not in the slave nodes. When I copy this input directory to
>> slave nodes, it works fine. I am not able to figure out how to fix this and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map and
>> map method contains contents of the input file. This should be fine for the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block by
>> block in HDFS, then you need to use the WholeFileInputFormat (google this
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, then
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>



-- 
Harsh J

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Dear Yong,

Thanks for your elaborate answer. Your answer really make sense and I am
ending something close to it expect shared storage.

In my usecase, I am not allowed to use any shared storage system. The
reason being that the slave nodes may not be safe for hosting sensible
data. (Because, they could belong to different enterprise, may be from
cloud) I do agree that we still need this data on the slave node while
doing processing and hence need to transfer the data from the enterprise
node to the processing nodes. But that's ok as this is better than using
the slave nodes for storage. If I can use shared storage then I could use
hdfs itself. I wrote simple example code with 2 node cluster setup and was
testing various input formats such as WholeFileInputFormat,
NLineInputFormat, TextInputFormat. I faced issues when I do not want to use
shared storage as I explained in my last email. I was thinking that having
the input file in the master node (job tracker) is sufficient and it will
send portion of the input file to the map process in the second node
(slave). But this was not the case as the method setInputPath() (and map
reduce system) expect this path is a shared one.  All these my observations
lead to straightforward question that "Is map reduce system expect a shared
storage system ? And that input directories need to be present in that
shared system? Is there a workaround for this issue?". Infact,I am prepared
to use hdfs just for convincing map reduce system and feed input to it. And
for actual processing I shall end up transferring the required data files
to the slave nodes.

I do note that I cannot enjoy the advantages that comes with hdfs such as
data replication, data location aware system etc.


with thanks and regards
rabmdu







On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> It is possible to do what you are trying to do, but only make sense if
> your MR job is very CPU intensive, and you want to use the CPU resource in
> your cluster, instead of the IO.
>
> You may want to do some research about what is the HDFS's role in Hadoop.
> First but not least, it provides a central storage for all the files will
> be processed by MR jobs. If you don't want to use HDFS, so you need to
>  identify a share storage to be shared among all the nodes in your cluster.
> HDFS is NOT required, but a shared storage is required in the cluster.
>
> For simply your question, let's just use NFS to replace HDFS. It is
> possible for a POC to help you understand how to set it up.
>
> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
> by this way:
>
> 1) Mount /share_data in all of your 2 data nodes. They need to have the
> same mount. So /share_data in each data node point to the same NFS
> location. It doesn't matter where you host this NFS share, but just make
> sure each data node mount it as the same /share_data
> 2) Create a folder under /share_data, put all your data into that folder.
> 3) When kick off your MR jobs, you need to give a full URL of the input
> path, like 'file:///shared_data/myfolder', also a full URL of the output
> path, like 'file:///shared_data/output'. In this way, each mapper will
> understand that in fact they will run the data from local file system,
> instead of HDFS. That's the reason you want to make sure each task node has
> the same mount path, as 'file:///shared_data/myfolder' should work fine for
> each  task node. Check this and make sure that /share_data/myfolder all
> point to the same path in each of your task node.
> 4) You want each mapper to process one file, so instead of using the
> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make
> sure that every file under '/share_data/myfolder' won't be split and sent
> to the same mapper processor.
> 5) In the above set up, I don't think you need to start NameNode or
> DataNode process any more, anyway you just use JobTracker and TaskTracker.
> 6) Obviously when your data is big, the NFS share will be your bottleneck.
> So maybe you can replace it with Share Network Storage, but above set up
> gives you a start point.
> 7) Keep in mind when set up like above, you lost the Data Replication,
> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
> your data, instead of any scalability of IO.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Fri, 23 Aug 2013 15:43:38 +0530
> Subject: Re: running map tasks in remote node
>
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
>  Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>
>
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Dear Yong,

Thanks for your elaborate answer. Your answer really make sense and I am
ending something close to it expect shared storage.

In my usecase, I am not allowed to use any shared storage system. The
reason being that the slave nodes may not be safe for hosting sensible
data. (Because, they could belong to different enterprise, may be from
cloud) I do agree that we still need this data on the slave node while
doing processing and hence need to transfer the data from the enterprise
node to the processing nodes. But that's ok as this is better than using
the slave nodes for storage. If I can use shared storage then I could use
hdfs itself. I wrote simple example code with 2 node cluster setup and was
testing various input formats such as WholeFileInputFormat,
NLineInputFormat, TextInputFormat. I faced issues when I do not want to use
shared storage as I explained in my last email. I was thinking that having
the input file in the master node (job tracker) is sufficient and it will
send portion of the input file to the map process in the second node
(slave). But this was not the case as the method setInputPath() (and map
reduce system) expect this path is a shared one.  All these my observations
lead to straightforward question that "Is map reduce system expect a shared
storage system ? And that input directories need to be present in that
shared system? Is there a workaround for this issue?". Infact,I am prepared
to use hdfs just for convincing map reduce system and feed input to it. And
for actual processing I shall end up transferring the required data files
to the slave nodes.

I do note that I cannot enjoy the advantages that comes with hdfs such as
data replication, data location aware system etc.


with thanks and regards
rabmdu







On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> It is possible to do what you are trying to do, but only make sense if
> your MR job is very CPU intensive, and you want to use the CPU resource in
> your cluster, instead of the IO.
>
> You may want to do some research about what is the HDFS's role in Hadoop.
> First but not least, it provides a central storage for all the files will
> be processed by MR jobs. If you don't want to use HDFS, so you need to
>  identify a share storage to be shared among all the nodes in your cluster.
> HDFS is NOT required, but a shared storage is required in the cluster.
>
> For simply your question, let's just use NFS to replace HDFS. It is
> possible for a POC to help you understand how to set it up.
>
> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
> by this way:
>
> 1) Mount /share_data in all of your 2 data nodes. They need to have the
> same mount. So /share_data in each data node point to the same NFS
> location. It doesn't matter where you host this NFS share, but just make
> sure each data node mount it as the same /share_data
> 2) Create a folder under /share_data, put all your data into that folder.
> 3) When kick off your MR jobs, you need to give a full URL of the input
> path, like 'file:///shared_data/myfolder', also a full URL of the output
> path, like 'file:///shared_data/output'. In this way, each mapper will
> understand that in fact they will run the data from local file system,
> instead of HDFS. That's the reason you want to make sure each task node has
> the same mount path, as 'file:///shared_data/myfolder' should work fine for
> each  task node. Check this and make sure that /share_data/myfolder all
> point to the same path in each of your task node.
> 4) You want each mapper to process one file, so instead of using the
> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make
> sure that every file under '/share_data/myfolder' won't be split and sent
> to the same mapper processor.
> 5) In the above set up, I don't think you need to start NameNode or
> DataNode process any more, anyway you just use JobTracker and TaskTracker.
> 6) Obviously when your data is big, the NFS share will be your bottleneck.
> So maybe you can replace it with Share Network Storage, but above set up
> gives you a start point.
> 7) Keep in mind when set up like above, you lost the Data Replication,
> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
> your data, instead of any scalability of IO.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Fri, 23 Aug 2013 15:43:38 +0530
> Subject: Re: running map tasks in remote node
>
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
>  Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>
>
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Dear Yong,

Thanks for your elaborate answer. Your answer really make sense and I am
ending something close to it expect shared storage.

In my usecase, I am not allowed to use any shared storage system. The
reason being that the slave nodes may not be safe for hosting sensible
data. (Because, they could belong to different enterprise, may be from
cloud) I do agree that we still need this data on the slave node while
doing processing and hence need to transfer the data from the enterprise
node to the processing nodes. But that's ok as this is better than using
the slave nodes for storage. If I can use shared storage then I could use
hdfs itself. I wrote simple example code with 2 node cluster setup and was
testing various input formats such as WholeFileInputFormat,
NLineInputFormat, TextInputFormat. I faced issues when I do not want to use
shared storage as I explained in my last email. I was thinking that having
the input file in the master node (job tracker) is sufficient and it will
send portion of the input file to the map process in the second node
(slave). But this was not the case as the method setInputPath() (and map
reduce system) expect this path is a shared one.  All these my observations
lead to straightforward question that "Is map reduce system expect a shared
storage system ? And that input directories need to be present in that
shared system? Is there a workaround for this issue?". Infact,I am prepared
to use hdfs just for convincing map reduce system and feed input to it. And
for actual processing I shall end up transferring the required data files
to the slave nodes.

I do note that I cannot enjoy the advantages that comes with hdfs such as
data replication, data location aware system etc.


with thanks and regards
rabmdu







On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> It is possible to do what you are trying to do, but only make sense if
> your MR job is very CPU intensive, and you want to use the CPU resource in
> your cluster, instead of the IO.
>
> You may want to do some research about what is the HDFS's role in Hadoop.
> First but not least, it provides a central storage for all the files will
> be processed by MR jobs. If you don't want to use HDFS, so you need to
>  identify a share storage to be shared among all the nodes in your cluster.
> HDFS is NOT required, but a shared storage is required in the cluster.
>
> For simply your question, let's just use NFS to replace HDFS. It is
> possible for a POC to help you understand how to set it up.
>
> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
> by this way:
>
> 1) Mount /share_data in all of your 2 data nodes. They need to have the
> same mount. So /share_data in each data node point to the same NFS
> location. It doesn't matter where you host this NFS share, but just make
> sure each data node mount it as the same /share_data
> 2) Create a folder under /share_data, put all your data into that folder.
> 3) When kick off your MR jobs, you need to give a full URL of the input
> path, like 'file:///shared_data/myfolder', also a full URL of the output
> path, like 'file:///shared_data/output'. In this way, each mapper will
> understand that in fact they will run the data from local file system,
> instead of HDFS. That's the reason you want to make sure each task node has
> the same mount path, as 'file:///shared_data/myfolder' should work fine for
> each  task node. Check this and make sure that /share_data/myfolder all
> point to the same path in each of your task node.
> 4) You want each mapper to process one file, so instead of using the
> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make
> sure that every file under '/share_data/myfolder' won't be split and sent
> to the same mapper processor.
> 5) In the above set up, I don't think you need to start NameNode or
> DataNode process any more, anyway you just use JobTracker and TaskTracker.
> 6) Obviously when your data is big, the NFS share will be your bottleneck.
> So maybe you can replace it with Share Network Storage, but above set up
> gives you a start point.
> 7) Keep in mind when set up like above, you lost the Data Replication,
> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
> your data, instead of any scalability of IO.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Fri, 23 Aug 2013 15:43:38 +0530
> Subject: Re: running map tasks in remote node
>
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
>  Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>
>
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Dear Yong,

Thanks for your elaborate answer. Your answer really make sense and I am
ending something close to it expect shared storage.

In my usecase, I am not allowed to use any shared storage system. The
reason being that the slave nodes may not be safe for hosting sensible
data. (Because, they could belong to different enterprise, may be from
cloud) I do agree that we still need this data on the slave node while
doing processing and hence need to transfer the data from the enterprise
node to the processing nodes. But that's ok as this is better than using
the slave nodes for storage. If I can use shared storage then I could use
hdfs itself. I wrote simple example code with 2 node cluster setup and was
testing various input formats such as WholeFileInputFormat,
NLineInputFormat, TextInputFormat. I faced issues when I do not want to use
shared storage as I explained in my last email. I was thinking that having
the input file in the master node (job tracker) is sufficient and it will
send portion of the input file to the map process in the second node
(slave). But this was not the case as the method setInputPath() (and map
reduce system) expect this path is a shared one.  All these my observations
lead to straightforward question that "Is map reduce system expect a shared
storage system ? And that input directories need to be present in that
shared system? Is there a workaround for this issue?". Infact,I am prepared
to use hdfs just for convincing map reduce system and feed input to it. And
for actual processing I shall end up transferring the required data files
to the slave nodes.

I do note that I cannot enjoy the advantages that comes with hdfs such as
data replication, data location aware system etc.


with thanks and regards
rabmdu







On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> It is possible to do what you are trying to do, but only make sense if
> your MR job is very CPU intensive, and you want to use the CPU resource in
> your cluster, instead of the IO.
>
> You may want to do some research about what is the HDFS's role in Hadoop.
> First but not least, it provides a central storage for all the files will
> be processed by MR jobs. If you don't want to use HDFS, so you need to
>  identify a share storage to be shared among all the nodes in your cluster.
> HDFS is NOT required, but a shared storage is required in the cluster.
>
> For simply your question, let's just use NFS to replace HDFS. It is
> possible for a POC to help you understand how to set it up.
>
> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
> by this way:
>
> 1) Mount /share_data in all of your 2 data nodes. They need to have the
> same mount. So /share_data in each data node point to the same NFS
> location. It doesn't matter where you host this NFS share, but just make
> sure each data node mount it as the same /share_data
> 2) Create a folder under /share_data, put all your data into that folder.
> 3) When kick off your MR jobs, you need to give a full URL of the input
> path, like 'file:///shared_data/myfolder', also a full URL of the output
> path, like 'file:///shared_data/output'. In this way, each mapper will
> understand that in fact they will run the data from local file system,
> instead of HDFS. That's the reason you want to make sure each task node has
> the same mount path, as 'file:///shared_data/myfolder' should work fine for
> each  task node. Check this and make sure that /share_data/myfolder all
> point to the same path in each of your task node.
> 4) You want each mapper to process one file, so instead of using the
> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make
> sure that every file under '/share_data/myfolder' won't be split and sent
> to the same mapper processor.
> 5) In the above set up, I don't think you need to start NameNode or
> DataNode process any more, anyway you just use JobTracker and TaskTracker.
> 6) Obviously when your data is big, the NFS share will be your bottleneck.
> So maybe you can replace it with Share Network Storage, but above set up
> gives you a start point.
> 7) Keep in mind when set up like above, you lost the Data Replication,
> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
> your data, instead of any scalability of IO.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Fri, 23 Aug 2013 15:43:38 +0530
> Subject: Re: running map tasks in remote node
>
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
>  Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>
>
>

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

It is possible to do what you are trying to do, but only make sense if your MR job is very CPU intensive, and you want to use the CPU resource in your cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First but not least, it provides a central storage for all the files will be processed by MR jobs. If you don't want to use HDFS, so you need to  identify a share storage to be shared among all the nodes in your cluster. HDFS is NOT required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, and TT running on DN). So instead of using HDFS, you can try to use NFS by this way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same mount. So /share_data in each data node point to the same NFS location. It doesn't matter where you host this NFS share, but just make sure each data node mount it as the same /share_data2) Create a folder under /share_data, put all your data into that folder.3) When kick off your MR jobs, you need to give a full URL of the input path, like 'file:///shared_data/myfolder', also a full URL of the output path, like 'file:///shared_data/output'. In this way, each mapper will understand that in fact they will run the data from local file system, instead of HDFS. That's the reason you want to make sure each task node has the same mount path, as 'file:///shared_data/myfolder' should work fine for each  task node. Check this and make sure that /share_data/myfolder all point to the same path in each of your task node.4) You want each mapper to process one file, so instead of using the default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure that every file under '/share_data/myfolder' won't be split and sent to the same mapper processor. 5) In the above set up, I don't think you need to start NameNode or DataNode process any more, anyway you just use JobTracker and TaskTracker.6) Obviously when your data is big, the NFS share will be your bottleneck. So maybe you can replace it with Share Network Storage, but above set up gives you a start point.7) Keep in mind when set up like above, you lost the Data Replication, Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU intensive. You simple want to use the Mapper/Reducer tasks to process your data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of my use case. I have my limitations in using hdfs but agree with the fact that using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node and want to do some processing in the cloud nodes. The policy does not allow us to configure and use cloud nodes as HDFS.  However, I would like to span a map process in those nodes. Hence, I set input path as local file system, for example, $HOME/inputs. I have a file listing filenames (10 lines) in this input directory.  I use NLineInputFormat and span 10 map process. Each map process gets a line. The map process will then do a file transfer and process it.  However, I get an error in the map saying that the FileNotFoundException $HOME/inputs. I am sure this directory is present in my master but not in the slave nodes. When I copy this input directory to slave nodes, it works fine. I am not able to figure out how to fix this and the reason for the error. I am not understand why it complains about the input directory is not present. As far as I know, slave nodes get a map and map method contains contents of the input file. This should be fine for the map logic to work.

with regardsrabmdu

On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com> wrote:

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.

Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 

Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.

 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.

 hdfs-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration><property>

          <name>dfs.replication</name>          <value>1</value>

</property><property>

  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>

</property><property>

  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>

</property><property>

     <name>mapred.job.tracker</name>    <value>A:9001</value>

</property> 

</configuration> mapred-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<configuration><property>

            <name>mapred.job.tracker</name>            <value>A:9001</value>

</property><property>

          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property></configuration>

 core-site.xml 

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

         <property>                <name>fs.default.name</name>

                <value>hdfs://A:9000</value>        </property>

</configuration> 

 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.

 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.

 Can anyone highlight where I am going wrong?

  regardsrab

Re: running map tasks in remote node

Posted by Shahab Yunus <sh...@gmail.com>.

You say:
"Each map process gets a line. The map process will then do a file transfer
and process it.  "

What file, from where to where is being transferred in the map? Are you
sure that the mappers are not complaining about 'this' file access? Because
this seem to be separate from the initial data input that each mapper gets
(basically your understanding "map method contains contents of the input
file")

Regards,
Shahab


On Fri, Aug 23, 2013 at 6:13 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block
>> by block in HDFS, then you need to use the WholeFileInputFormat (google
>> this how to write one). So you don't need a file to list all the files to
>> be processed, just put them into one folder in the sharing file system,
>> then send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>>  Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file
>> and I wish to generate 10 map tasks and execute them in parallel in 10
>> nodes. I started with basic tutorial on hadoop and could setup single node
>> hadoop cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>
>

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

It is possible to do what you are trying to do, but only make sense if your MR job is very CPU intensive, and you want to use the CPU resource in your cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First but not least, it provides a central storage for all the files will be processed by MR jobs. If you don't want to use HDFS, so you need to  identify a share storage to be shared among all the nodes in your cluster. HDFS is NOT required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, and TT running on DN). So instead of using HDFS, you can try to use NFS by this way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same mount. So /share_data in each data node point to the same NFS location. It doesn't matter where you host this NFS share, but just make sure each data node mount it as the same /share_data2) Create a folder under /share_data, put all your data into that folder.3) When kick off your MR jobs, you need to give a full URL of the input path, like 'file:///shared_data/myfolder', also a full URL of the output path, like 'file:///shared_data/output'. In this way, each mapper will understand that in fact they will run the data from local file system, instead of HDFS. That's the reason you want to make sure each task node has the same mount path, as 'file:///shared_data/myfolder' should work fine for each  task node. Check this and make sure that /share_data/myfolder all point to the same path in each of your task node.4) You want each mapper to process one file, so instead of using the default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure that every file under '/share_data/myfolder' won't be split and sent to the same mapper processor. 5) In the above set up, I don't think you need to start NameNode or DataNode process any more, anyway you just use JobTracker and TaskTracker.6) Obviously when your data is big, the NFS share will be your bottleneck. So maybe you can replace it with Share Network Storage, but above set up gives you a start point.7) Keep in mind when set up like above, you lost the Data Replication, Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU intensive. You simple want to use the Mapper/Reducer tasks to process your data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of my use case. I have my limitations in using hdfs but agree with the fact that using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node and want to do some processing in the cloud nodes. The policy does not allow us to configure and use cloud nodes as HDFS.  However, I would like to span a map process in those nodes. Hence, I set input path as local file system, for example, $HOME/inputs. I have a file listing filenames (10 lines) in this input directory.  I use NLineInputFormat and span 10 map process. Each map process gets a line. The map process will then do a file transfer and process it.  However, I get an error in the map saying that the FileNotFoundException $HOME/inputs. I am sure this directory is present in my master but not in the slave nodes. When I copy this input directory to slave nodes, it works fine. I am not able to figure out how to fix this and the reason for the error. I am not understand why it complains about the input directory is not present. As far as I know, slave nodes get a map and map method contains contents of the input file. This should be fine for the map logic to work.

with regardsrabmdu

On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com> wrote:

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.

Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 

Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.

 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.

 hdfs-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration><property>

          <name>dfs.replication</name>          <value>1</value>

</property><property>

  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>

</property><property>

  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>

</property><property>

     <name>mapred.job.tracker</name>    <value>A:9001</value>

</property> 

</configuration> mapred-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<configuration><property>

            <name>mapred.job.tracker</name>            <value>A:9001</value>

</property><property>

          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property></configuration>

 core-site.xml 

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

         <property>                <name>fs.default.name</name>

                <value>hdfs://A:9000</value>        </property>

</configuration> 

 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.

 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.

 Can anyone highlight where I am going wrong?

  regardsrab

Re: running map tasks in remote node

Posted by Shahab Yunus <sh...@gmail.com>.

You say:
"Each map process gets a line. The map process will then do a file transfer
and process it.  "

What file, from where to where is being transferred in the map? Are you
sure that the mappers are not complaining about 'this' file access? Because
this seem to be separate from the initial data input that each mapper gets
(basically your understanding "map method contains contents of the input
file")

Regards,
Shahab


On Fri, Aug 23, 2013 at 6:13 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block
>> by block in HDFS, then you need to use the WholeFileInputFormat (google
>> this how to write one). So you don't need a file to list all the files to
>> be processed, just put them into one folder in the sharing file system,
>> then send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>>  Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file
>> and I wish to generate 10 map tasks and execute them in parallel in 10
>> nodes. I started with basic tutorial on hadoop and could setup single node
>> hadoop cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>
>

Re: running map tasks in remote node

Posted by Shahab Yunus <sh...@gmail.com>.

You say:
"Each map process gets a line. The map process will then do a file transfer
and process it.  "

What file, from where to where is being transferred in the map? Are you
sure that the mappers are not complaining about 'this' file access? Because
this seem to be separate from the initial data input that each mapper gets
(basically your understanding "map method contains contents of the input
file")

Regards,
Shahab


On Fri, Aug 23, 2013 at 6:13 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block
>> by block in HDFS, then you need to use the WholeFileInputFormat (google
>> this how to write one). So you don't need a file to list all the files to
>> be processed, just put them into one folder in the sharing file system,
>> then send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>>  Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file
>> and I wish to generate 10 map tasks and execute them in parallel in 10
>> nodes. I started with basic tutorial on hadoop and could setup single node
>> hadoop cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>
>

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

It is possible to do what you are trying to do, but only make sense if your MR job is very CPU intensive, and you want to use the CPU resource in your cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First but not least, it provides a central storage for all the files will be processed by MR jobs. If you don't want to use HDFS, so you need to  identify a share storage to be shared among all the nodes in your cluster. HDFS is NOT required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, and TT running on DN). So instead of using HDFS, you can try to use NFS by this way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same mount. So /share_data in each data node point to the same NFS location. It doesn't matter where you host this NFS share, but just make sure each data node mount it as the same /share_data2) Create a folder under /share_data, put all your data into that folder.3) When kick off your MR jobs, you need to give a full URL of the input path, like 'file:///shared_data/myfolder', also a full URL of the output path, like 'file:///shared_data/output'. In this way, each mapper will understand that in fact they will run the data from local file system, instead of HDFS. That's the reason you want to make sure each task node has the same mount path, as 'file:///shared_data/myfolder' should work fine for each  task node. Check this and make sure that /share_data/myfolder all point to the same path in each of your task node.4) You want each mapper to process one file, so instead of using the default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure that every file under '/share_data/myfolder' won't be split and sent to the same mapper processor. 5) In the above set up, I don't think you need to start NameNode or DataNode process any more, anyway you just use JobTracker and TaskTracker.6) Obviously when your data is big, the NFS share will be your bottleneck. So maybe you can replace it with Share Network Storage, but above set up gives you a start point.7) Keep in mind when set up like above, you lost the Data Replication, Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU intensive. You simple want to use the Mapper/Reducer tasks to process your data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of my use case. I have my limitations in using hdfs but agree with the fact that using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node and want to do some processing in the cloud nodes. The policy does not allow us to configure and use cloud nodes as HDFS.  However, I would like to span a map process in those nodes. Hence, I set input path as local file system, for example, $HOME/inputs. I have a file listing filenames (10 lines) in this input directory.  I use NLineInputFormat and span 10 map process. Each map process gets a line. The map process will then do a file transfer and process it.  However, I get an error in the map saying that the FileNotFoundException $HOME/inputs. I am sure this directory is present in my master but not in the slave nodes. When I copy this input directory to slave nodes, it works fine. I am not able to figure out how to fix this and the reason for the error. I am not understand why it complains about the input directory is not present. As far as I know, slave nodes get a map and map method contains contents of the input file. This should be fine for the map logic to work.

with regardsrabmdu

On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com> wrote:

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.

Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 

Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.

 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.

 hdfs-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration><property>

          <name>dfs.replication</name>          <value>1</value>

</property><property>

  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>

</property><property>

  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>

</property><property>

     <name>mapred.job.tracker</name>    <value>A:9001</value>

</property> 

</configuration> mapred-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<configuration><property>

            <name>mapred.job.tracker</name>            <value>A:9001</value>

</property><property>

          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property></configuration>

 core-site.xml 

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

         <property>                <name>fs.default.name</name>

                <value>hdfs://A:9000</value>        </property>

</configuration> 

 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.

 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.

 Can anyone highlight where I am going wrong?

  regardsrab

Re: running map tasks in remote node

Posted by Shahab Yunus <sh...@gmail.com>.

You say:
"Each map process gets a line. The map process will then do a file transfer
and process it.  "

What file, from where to where is being transferred in the map? Are you
sure that the mappers are not complaining about 'this' file access? Because
this seem to be separate from the initial data input that each mapper gets
(basically your understanding "map method contains contents of the input
file")

Regards,
Shahab


On Fri, Aug 23, 2013 at 6:13 AM, rab ra <ra...@gmail.com> wrote:

> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block
>> by block in HDFS, then you need to use the WholeFileInputFormat (google
>> this how to write one). So you don't need a file to list all the files to
>> be processed, just put them into one folder in the sharing file system,
>> then send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ------------------------------
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rabmdu@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>>  Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file
>> and I wish to generate 10 map tasks and execute them in parallel in 10
>> nodes. I started with basic tutorial on hadoop and could setup single node
>> hadoop cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>
>

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

It is possible to do what you are trying to do, but only make sense if your MR job is very CPU intensive, and you want to use the CPU resource in your cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First but not least, it provides a central storage for all the files will be processed by MR jobs. If you don't want to use HDFS, so you need to  identify a share storage to be shared among all the nodes in your cluster. HDFS is NOT required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, and TT running on DN). So instead of using HDFS, you can try to use NFS by this way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same mount. So /share_data in each data node point to the same NFS location. It doesn't matter where you host this NFS share, but just make sure each data node mount it as the same /share_data2) Create a folder under /share_data, put all your data into that folder.3) When kick off your MR jobs, you need to give a full URL of the input path, like 'file:///shared_data/myfolder', also a full URL of the output path, like 'file:///shared_data/output'. In this way, each mapper will understand that in fact they will run the data from local file system, instead of HDFS. That's the reason you want to make sure each task node has the same mount path, as 'file:///shared_data/myfolder' should work fine for each  task node. Check this and make sure that /share_data/myfolder all point to the same path in each of your task node.4) You want each mapper to process one file, so instead of using the default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure that every file under '/share_data/myfolder' won't be split and sent to the same mapper processor. 5) In the above set up, I don't think you need to start NameNode or DataNode process any more, anyway you just use JobTracker and TaskTracker.6) Obviously when your data is big, the NFS share will be your bottleneck. So maybe you can replace it with Share Network Storage, but above set up gives you a start point.7) Keep in mind when set up like above, you lost the Data Replication, Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU intensive. You simple want to use the Mapper/Reducer tasks to process your data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of my use case. I have my limitations in using hdfs but agree with the fact that using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node and want to do some processing in the cloud nodes. The policy does not allow us to configure and use cloud nodes as HDFS.  However, I would like to span a map process in those nodes. Hence, I set input path as local file system, for example, $HOME/inputs. I have a file listing filenames (10 lines) in this input directory.  I use NLineInputFormat and span 10 map process. Each map process gets a line. The map process will then do a file transfer and process it.  However, I get an error in the map saying that the FileNotFoundException $HOME/inputs. I am sure this directory is present in my master but not in the slave nodes. When I copy this input directory to slave nodes, it works fine. I am not able to figure out how to fix this and the reason for the error. I am not understand why it complains about the input directory is not present. As far as I know, slave nodes get a map and map method contains contents of the input file. This should be fine for the map logic to work.

with regardsrabmdu

On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com> wrote:

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.

Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 

Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.

 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.

 hdfs-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration><property>

          <name>dfs.replication</name>          <value>1</value>

</property><property>

  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>

</property><property>

  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>

</property><property>

     <name>mapred.job.tracker</name>    <value>A:9001</value>

</property> 

</configuration> mapred-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<configuration><property>

            <name>mapred.job.tracker</name>            <value>A:9001</value>

</property><property>

          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property></configuration>

 core-site.xml 

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

         <property>                <name>fs.default.name</name>

                <value>hdfs://A:9000</value>        </property>

</configuration> 

 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.

 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.

 Can anyone highlight where I am going wrong?

  regardsrab

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.


with regards
rabmdu




On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.


with regards
rabmdu




On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.


with regards
rabmdu




On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>

Re: running map tasks in remote node

Posted by rab ra <ra...@gmail.com>.

Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.


with regards
rabmdu




On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <ja...@hotmail.com>wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> ------------------------------
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rabmdu@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>           <name>dfs.replication</name>
>           <value>1</value>
> </property>
> <property>
>   <name>dfs.name.dir</name>
>   <value>/tmp/hadoop-bala/dfs/name</value>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/tmp/hadoop-bala/dfs/data</value>
> </property>
> <property>
>      <name>mapred.job.tracker</name>
>     <value>A:9001</value>
> </property>
>
> </configuration>
>
> mapred-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>             <name>mapred.job.tracker</name>
>             <value>A:9001</value>
> </property>
> <property>
>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>            <value>1</value>
> </property>
> </configuration>
>
> core-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>          <property>
>                 <name>fs.default.name</name>
>                 <value>hdfs://A:9000</value>
>         </property>
> </configuration>
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property> 
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml 
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration> 
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.
 Can anyone highlight where I am going wrong?
  regardsrab

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property> 
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml 
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration> 
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.
 Can anyone highlight where I am going wrong?
  regardsrab

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property> 
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml 
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration> 
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.
 Can anyone highlight where I am going wrong?
  regardsrab

RE: running map tasks in remote node

Posted by java8964 java8964 <ja...@hotmail.com>.

If you don't plan to use HDFS, what kind of sharing file system you are going to use between cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS, then you need to use the WholeFileInputFormat (google this how to write one). So you don't need a file to list all the files to be processed, just put them into one folder in the sharing file system, then send this folder to your MR job. In this way, as long as each node can access it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map reduce without HDFS. Here, I will have a text file containing a list of file names to process. Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property> 
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml 
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration> 
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line but they are all processed in A. Ideally, I would expect those processing to take place in B.
 Can anyone highlight where I am going wrong?
  regardsrab