You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by 胡斐 <hu...@gmail.com> on 2014/12/01 17:38:01 UTC

Custom FileInputFormat.class

Hi,

I want to custom FileInputFormat.class. In order to determine which host
the specific part of a file belongs to, I need to open the file in HDFS and
read some information. It will take me nearly 500ms to open a file and get
the information I need. But now I have thousands of files to deal with, so
it would be a long time if I deal with all of them as the above.

Is there any better solution to reduce the time when the number of files is
large?

Thanks in advance!
Fei

Re: Custom FileInputFormat.class

Posted by Pradeep Gollakota <pr...@gmail.com>.
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 <hu...@gmail.com> wrote:

> Hi,
>
> I want to custom FileInputFormat.class. In order to determine which host
> the specific part of a file belongs to, I need to open the file in HDFS and
> read some information. It will take me nearly 500ms to open a file and get
> the information I need. But now I have thousands of files to deal with, so
> it would be a long time if I deal with all of them as the above.
>
> Is there any better solution to reduce the time when the number of files
> is large?
>
> Thanks in advance!
> Fei
>
>

Re: Custom FileInputFormat.class

Posted by Pradeep Gollakota <pr...@gmail.com>.
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 <hu...@gmail.com> wrote:

> Hi,
>
> I want to custom FileInputFormat.class. In order to determine which host
> the specific part of a file belongs to, I need to open the file in HDFS and
> read some information. It will take me nearly 500ms to open a file and get
> the information I need. But now I have thousands of files to deal with, so
> it would be a long time if I deal with all of them as the above.
>
> Is there any better solution to reduce the time when the number of files
> is large?
>
> Thanks in advance!
> Fei
>
>

Re: Custom FileInputFormat.class

Posted by Pradeep Gollakota <pr...@gmail.com>.
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 <hu...@gmail.com> wrote:

> Hi,
>
> I want to custom FileInputFormat.class. In order to determine which host
> the specific part of a file belongs to, I need to open the file in HDFS and
> read some information. It will take me nearly 500ms to open a file and get
> the information I need. But now I have thousands of files to deal with, so
> it would be a long time if I deal with all of them as the above.
>
> Is there any better solution to reduce the time when the number of files
> is large?
>
> Thanks in advance!
> Fei
>
>

Re: Custom FileInputFormat.class

Posted by Pradeep Gollakota <pr...@gmail.com>.
Can you expand on your use case a little bit please? It may be that you're
duplicating functionality.

You can take a look at the CombineFileInputFormat for inspiration. If this
is indeed taking a long time, one cheap to implement thing you can do is to
parallelize the calls to get block locations.

Another question to ask yourself is whether it is worth it to optimize this
portion. In many use cases, (certainly mine), the bottleneck is the running
job itself. So the launch overhead is comparatively minimal.

Hope this helps.
Pradeep

On Mon Dec 01 2014 at 8:38:30 AM 胡斐 <hu...@gmail.com> wrote:

> Hi,
>
> I want to custom FileInputFormat.class. In order to determine which host
> the specific part of a file belongs to, I need to open the file in HDFS and
> read some information. It will take me nearly 500ms to open a file and get
> the information I need. But now I have thousands of files to deal with, so
> it would be a long time if I deal with all of them as the above.
>
> Is there any better solution to reduce the time when the number of files
> is large?
>
> Thanks in advance!
> Fei
>
>