You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Kartashov, Andy" <An...@mpac.ca> on 2012/10/24 16:23:27 UTC

question on FileInputFormat.addInputPath and data access

Gents,
Two questions:

1. Say you have 5 folders with input data (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
You will write your MR job to access your files by listing them in :
FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5");
Q: Is there a way to move the above folders to the parent folder say, "the_folder", so that the dir struct will be the_folder/fold1, the_folder/fold2... Will it be possible to access your files with something like: FileInputFormat.addInputPaths(job, "the_fold1/*"); or similar?
I am asking in case your input folders list grows too long. How to curb that?

2. Hypothetically speaking in fully-dist mode cluster your folders with Data are located as follows: Node1: (fold1,fold2,fold3) and Node2:(fold4, fold5)

Q: Do we change below command or will NN and JT take care how of locating those files?
FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5");
2a. Using Data balancer which splits input/moves Data across additional DNs indicated in conf/slaves, is it possible to run "hdfs dfs -ls -r " command on the slave node that runs DN on a separate machine? I have

Cheers,

NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

Re: question on FileInputFormat.addInputPath and data access

Posted by Harsh J <ha...@cloudera.com>.

Hi Andy,

Inline.

On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <An...@mpac.ca> wrote:
> Gents,
>
> Two questions:
>
> 1.       Say you have 5 folders with input data
> (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
>
> You will write your MR job to access your files by listing them in :
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);
>
> Q: Is there a way to move the above folders to the parent folder say,
> “the_folder”, so that the dir struct will be the_folder/fold1,
> the_folder/fold2... Will it be possible to access your files with something
> like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?
>
> I am asking in case your input folders list grows too long. How to curb
> that?

Yes, the FileInputFormat.addInputPath(…) API [1] supports glob
patterns and you can pass it a Path object of "the_fold/*/*" or so.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)

> 2.       Hypothetically speaking  in fully-dist mode cluster your folders
> with Data are located as follows:  Node1: (fold1,fold2,fold3) and
> Node2:(fold4, fold5)
>
> Q: Do we change below command  or will NN and JT  take care how of locating
> those files?
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

JT and NN take care of data locality for you. You need not worry about
that (manually) at all.

>      2a.     Using Data balancer which splits input/moves Data across
> additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs
> –ls –r “ command  on the slave node that runs DN on a separate machine? I
> have

Yes, you can run regular HDFS client operations (such as ls, cat, job
submission) from any machine, regardless of the machine being or not
being a slave or master node. The form of access a client program uses
is not tied to those files/aspects.

> Cheers,
>
> AK
>
> NOTICE: This e-mail message and any attachments are confidential, subject to
> copyright and may be privileged. Any unauthorized use, copying or disclosure
> is prohibited. If you are not the intended recipient, please delete and
> contact the sender immediately. Please consider the environment before
> printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
> l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
> être couverts par le secret professionnel. Toute utilisation, copie ou
> divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
> prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur.
> Veuillez penser à l'environnement avant d'imprimer le présent courriel



-- 
Harsh J

Re: question on FileInputFormat.addInputPath and data access

Posted by Harsh J <ha...@cloudera.com>.

Hi Andy,

Inline.

On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <An...@mpac.ca> wrote:
> Gents,
>
> Two questions:
>
> 1.       Say you have 5 folders with input data
> (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
>
> You will write your MR job to access your files by listing them in :
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);
>
> Q: Is there a way to move the above folders to the parent folder say,
> “the_folder”, so that the dir struct will be the_folder/fold1,
> the_folder/fold2... Will it be possible to access your files with something
> like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?
>
> I am asking in case your input folders list grows too long. How to curb
> that?

Yes, the FileInputFormat.addInputPath(…) API [1] supports glob
patterns and you can pass it a Path object of "the_fold/*/*" or so.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)

> 2.       Hypothetically speaking  in fully-dist mode cluster your folders
> with Data are located as follows:  Node1: (fold1,fold2,fold3) and
> Node2:(fold4, fold5)
>
> Q: Do we change below command  or will NN and JT  take care how of locating
> those files?
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

JT and NN take care of data locality for you. You need not worry about
that (manually) at all.

>      2a.     Using Data balancer which splits input/moves Data across
> additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs
> –ls –r “ command  on the slave node that runs DN on a separate machine? I
> have

Yes, you can run regular HDFS client operations (such as ls, cat, job
submission) from any machine, regardless of the machine being or not
being a slave or master node. The form of access a client program uses
is not tied to those files/aspects.

> Cheers,
>
> AK
>
> NOTICE: This e-mail message and any attachments are confidential, subject to
> copyright and may be privileged. Any unauthorized use, copying or disclosure
> is prohibited. If you are not the intended recipient, please delete and
> contact the sender immediately. Please consider the environment before
> printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
> l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
> être couverts par le secret professionnel. Toute utilisation, copie ou
> divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
> prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur.
> Veuillez penser à l'environnement avant d'imprimer le présent courriel



-- 
Harsh J

Re: question on FileInputFormat.addInputPath and data access

Posted by Harsh J <ha...@cloudera.com>.

Hi Andy,

Inline.

On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <An...@mpac.ca> wrote:
> Gents,
>
> Two questions:
>
> 1.       Say you have 5 folders with input data
> (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
>
> You will write your MR job to access your files by listing them in :
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);
>
> Q: Is there a way to move the above folders to the parent folder say,
> “the_folder”, so that the dir struct will be the_folder/fold1,
> the_folder/fold2... Will it be possible to access your files with something
> like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?
>
> I am asking in case your input folders list grows too long. How to curb
> that?

Yes, the FileInputFormat.addInputPath(…) API [1] supports glob
patterns and you can pass it a Path object of "the_fold/*/*" or so.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)

> 2.       Hypothetically speaking  in fully-dist mode cluster your folders
> with Data are located as follows:  Node1: (fold1,fold2,fold3) and
> Node2:(fold4, fold5)
>
> Q: Do we change below command  or will NN and JT  take care how of locating
> those files?
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

JT and NN take care of data locality for you. You need not worry about
that (manually) at all.

>      2a.     Using Data balancer which splits input/moves Data across
> additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs
> –ls –r “ command  on the slave node that runs DN on a separate machine? I
> have

Yes, you can run regular HDFS client operations (such as ls, cat, job
submission) from any machine, regardless of the machine being or not
being a slave or master node. The form of access a client program uses
is not tied to those files/aspects.

> Cheers,
>
> AK
>
> NOTICE: This e-mail message and any attachments are confidential, subject to
> copyright and may be privileged. Any unauthorized use, copying or disclosure
> is prohibited. If you are not the intended recipient, please delete and
> contact the sender immediately. Please consider the environment before
> printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
> l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
> être couverts par le secret professionnel. Toute utilisation, copie ou
> divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
> prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur.
> Veuillez penser à l'environnement avant d'imprimer le présent courriel



-- 
Harsh J

Re: question on FileInputFormat.addInputPath and data access

Posted by Harsh J <ha...@cloudera.com>.

Hi Andy,

Inline.

On Wed, Oct 24, 2012 at 7:53 PM, Kartashov, Andy <An...@mpac.ca> wrote:
> Gents,
>
> Two questions:
>
> 1.       Say you have 5 folders with input data
> (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
>
> You will write your MR job to access your files by listing them in :
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);
>
> Q: Is there a way to move the above folders to the parent folder say,
> “the_folder”, so that the dir struct will be the_folder/fold1,
> the_folder/fold2... Will it be possible to access your files with something
> like: FileInputFormat.addInputPaths(job, "the_fold1/*”); or similar?
>
> I am asking in case your input folders list grows too long. How to curb
> that?

Yes, the FileInputFormat.addInputPath(…) API [1] supports glob
patterns and you can pass it a Path object of "the_fold/*/*" or so.

[1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)

> 2.       Hypothetically speaking  in fully-dist mode cluster your folders
> with Data are located as follows:  Node1: (fold1,fold2,fold3) and
> Node2:(fold4, fold5)
>
> Q: Do we change below command  or will NN and JT  take care how of locating
> those files?
>
> FileInputFormat.addInputPaths(job, "fold1, fold2, fold3…,fold5”);

JT and NN take care of data locality for you. You need not worry about
that (manually) at all.

>      2a.     Using Data balancer which splits input/moves Data across
> additional DNs indicated in conf/slaves,  is it possible to run “hdfs dfs
> –ls –r “ command  on the slave node that runs DN on a separate machine? I
> have

Yes, you can run regular HDFS client operations (such as ls, cat, job
submission) from any machine, regardless of the machine being or not
being a slave or master node. The form of access a client program uses
is not tied to those files/aspects.

> Cheers,
>
> AK
>
> NOTICE: This e-mail message and any attachments are confidential, subject to
> copyright and may be privileged. Any unauthorized use, copying or disclosure
> is prohibited. If you are not the intended recipient, please delete and
> contact the sender immediately. Please consider the environment before
> printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
> l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
> être couverts par le secret professionnel. Toute utilisation, copie ou
> divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
> prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur.
> Veuillez penser à l'environnement avant d'imprimer le présent courriel



-- 
Harsh J