You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by elton sky <el...@gmail.com> on 2010/07/29 16:20:34 UTC

how io.file.buffer.size works?

I think my question is ignored, so just post it again:

I am a bit confused of how this attribute is used.

My understanding is it's related with file read/write. And I can see, in
LineReader.java, it's used as the default buffer size for each line; in
BlockReader.newBlockReader(), it's used as the internal buffer size of the
BufferedInputStream. Also, in compression related classes, it's used as
default buffer size. However, when creating a file (write), bufferSize is
not seemed to be used at all.

E.g.
DFSClient.DFSOutputStream(
String src, int buffersize, Progressable progress, LocatedBlock lastBlock,
FileStatus stat,int bytesPerChecksum);
it has a buffersize param, but never used in its definition. In other words,
it's not used for writing at all?

Is this right?

Re: reuse cached files

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

> I am actually doing some test to see the performance. I want to eliminate the
> interference of distributed cache. I find there is method in the api to purge
> the cache. That might be what I want.

So, you want to run multiple versions of a job (possibly different job
parameters) and measure them relatively. Is that correct ?

I can think of some options:
- Is it possible, not to use distributed cache at all ? You could
possibly bundle the files along with the job jar.
- You could run the job on fresh cluster instances (a more costly
option, nevertheless)
- You could change the timestamps of the distributed cache files on
DFS somehow before each invocation of the job. This will make Hadoop
believe that the files have been changed, and this will cause
distributed cache to fetch the files again.


The purgeCache API you are seeing is very mapreduce framework
specific. This is *not* to be used by client code, and is not
guaranteed to work. In the latter versions of Hadoop (0.21 and trunk),
these methods have been deprecated in the public API and will be
removed altogether.

Thanks
hemanth

>
> Thanks,
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Hemanth Yamijala <yh...@gmail.com>
> 收件人： common-user@hadoop.apache.org
> 发送日期： 2010/8/2 (周一) 12:56:25 上午
> 主   题： Re: reuse cached files
>
> Hi,
>
>> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to
>> resent exactly the same files to cache for every job?
>
> I may be able to answer this better if I understand the use case. If
> you need the same files for every job, why would you need to send them
> afresh each time ? If something is cached, it can be reused, no ? I am
> sure I must be missing something in your requirement ...
>
> Thanks
> Hemanth
>
>
>
>
>

Re: reuse cached files

Posted by Gang Luo <lg...@yahoo.com.cn>.

I am actually doing some test to see the performance. I want to eliminate the 
interference of distributed cache. I find there is method in the api to purge 
the cache. That might be what I want.

Thanks,
-Gang



----- 原始邮件 ----
发件人： Hemanth Yamijala <yh...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2010/8/2 (周一) 12:56:25 上午
主   题： Re: reuse cached files

Hi,

> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to
> resent exactly the same files to cache for every job?

I may be able to answer this better if I understand the use case. If
you need the same files for every job, why would you need to send them
afresh each time ? If something is cached, it can be reused, no ? I am
sure I must be missing something in your requirement ...

Thanks
Hemanth

Re: reuse cached files

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to
> resent exactly the same files to cache for every job?

I may be able to answer this better if I understand the use case. If
you need the same files for every job, why would you need to send them
afresh each time ? If something is cached, it can be reused, no ? I am
sure I must be missing something in your requirement ...

Thanks
Hemanth

Re: reuse cached files

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to 
resent exactly the same files to cache for every job?

Thanks,
-Gang

Re: reuse cached files

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

> if I use distributed cache to send some files to all the nodes in one MR job,
> can I reuse these cached files locally in my next job, or will hadoop re-sent
> these files again?

Cache files are reused across Jobs. From trunk onwards, they will be
restricted to be reused across jobs of the same user, unless they are
marked 'public' in which case they can be reused by jobs across all
users.

Thanks
hemanth

reuse cached files

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
if I use distributed cache to send some files to all the nodes in one MR job, 
can I reuse these cached files locally in my next job, or will hadoop re-sent 
these files again?

Thanks,
-Gang