You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Shimi K <sh...@gmail.com> on 2008/02/10 15:51:40 UTC

Caching frequently map input files

Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
from the disk each time a file is needed no matter if it was the same file
that was required by the last job on the same node?

I am currently using version 0.14.4

- Shimi

Re: Caching frequently map input files

Posted by Amar Kamat <am...@yahoo-inc.com>.
Shimi K wrote:
> Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
> from the disk each time a file is needed no matter if it was the same file
> that was required by the last job on the same node?
>   
Hadoop uses data locality for scheduling the tasks. Most of the hits we 
get is through data locality. For the rest we simply transfer the file 
over the network. If the data is huge then as Arun mentioned there will 
be no caching. What could be done is to make the copy local and somehow 
inform the namenode. But I guess the complexity involved in doing this 
handshake and its effect on re-balancing would be more. Also this amount 
of replication will lead to lot of redundancy and hence wastage of 
space. That might be one reason this is not done. If the data size is 
small then it could be cached but the problem is keeping this kind of 
state information will have high complexity as compared to the current 
state where the jobs are fairly independent and hence easy to handle. 
Also its better to copy these files over the network (since the size is 
small and the chance of this happening is also low). Again the question 
is how much of the data would you keep in cache, what policy would you 
use to free the cache and how would you inform the JT about cache 
deletions? Its good to keep it simple.
Amar
> I am currently using version 0.14.4
>
> - Shimi
>
>   


Re: Caching frequently map input files

Posted by Jason Venner <ja...@attributor.com>.
The Joydeep's ramfs solution is even simpler :)

Jason Venner wrote:
> I would propose either store the files in hbase, which will keep an 
> active copy available, or replicate the files manually to all of your 
> machines, and have a task that mmaps the file in to shared memory. The 
> mmap can lock the pages in and fault them in to ensure they are resident.
>
> Then have your jobs attach the shared memory, or simply read the files 
> normally.
>
> Shimi K wrote:
>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload 
>> files
>> from the disk each time a file is needed no matter if it was the same 
>> file
>> that was required by the last job on the same node?
>>
>> I am currently using version 0.14.4
>>
>> - Shimi
>>
>>   
-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: Caching frequently map input files

Posted by Jason Venner <ja...@attributor.com>.
I would propose either store the files in hbase, which will keep an 
active copy available, or replicate the files manually to all of your 
machines, and have a task that mmaps the file in to shared memory. The 
mmap can lock the pages in and fault them in to ensure they are resident.

Then have your jobs attach the shared memory, or simply read the files 
normally.

Shimi K wrote:
> Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
> from the disk each time a file is needed no matter if it was the same file
> that was required by the last job on the same node?
>
> I am currently using version 0.14.4
>
> - Shimi
>
>   
-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: Caching frequently map input files

Posted by Ted Dunning <td...@veoh.com>.
Actually, no.

You need to write something that never exits.  More like a web server never
exits.  Something that handles many requests without exiting.  If you want
<100ms response times, that is going to be the only way.

There are many reasons for slow startup of map-reduce jobs.  One prominent
reason is that the executable code for each program has to be copied into
HDFS and then each task node has to copy the executable code out of HDFS.
Then the tasktrackers in the cluster have to launch multiple JVM's.  This
all takes time.  The amount of time is not very important if you are running
a job even a few minutes long, but is a complete show-stopper for your
application where, as you say, milliseconds matter.

If, however, your jobs are already launched and already have a file loaded,
then your requests should have less than 1 ms of overhead and will get the
parallelism and redundancy that you desire.


On 2/11/08 6:31 AM, "Shimi K" <sh...@gmail.com> wrote:

> To do such a thing I will need to implement something which is
> very similar to Hadoop map reduce but with faster startup job time. Why does
> it takes Hadoop so long to start the job?


Re: Caching frequently map input files

Posted by Shimi K <sh...@gmail.com>.
If I understood right, except the front end part and the caching at the
nodes, most of what you wrote Hadoop does. Anyway if I will use Hadoop I
will need a front end that will send sub queries to Hadoop (I already wrote
that part). To do such a thing I will need to implement something which is
very similar to Hadoop map reduce but with faster startup job time. Why does
it takes Hadoop so long to start the job?

On Feb 11, 2008 3:20 PM, Ted Dunning <td...@veoh.com> wrote:

>
>
> You should be looking at HDFS (part of hadoop) plus hbase or code that you
> write yourself.
>
> Hadoop is built in two parts.  One part is the distributed file system
> that
> provides replication and similar functions.  You can access this file
> system
> pretty easily from Java.  Your requirements are pretty easy to meet using
> HDFS.  HDFS provides the capabilities to write files (once without
> modification after creation) and read them.  You can also tell where
> pieces
> of a file are.  Your files are small enough so that they will fit in a
> single HDFS block so their location will be simple.  HDFS is highly
> reliable
> and can provide very high read bandwidth, especially for processes that
> run
> on the same nodes as the storage nodes.
>
> The second part is the map-reduce framework.  This handles the invocation
> of
> map-reduce jobs which are batch oriented and take several seconds to
> start.
> Hadoop's map reduce framework is not suitable for your application because
> of your real-time query requirement.
>
> Hbase is a database layer similar to Google's bigtables.  It might be
> helpful to you, but it sounds like your queries are more complex than what
> hbase provides.
>
> My guess is that to satisfy your needs, you need to have a farm of query
> servers that read files from HDFS, but which run essentially forever.  You
> will need a front end server that accepts queries and forwards sub-queries
> on to machines which read data from HDFS at startup time and then just
> handle sub-queries.  Your front end machine should handle sending
> sub-queries to nodes that have the data in question in local storage if
> possible and should handle sending sub-queries to a different machine if
> an
> answer doesn't come back when expected.  At 10 per second, you probably
> don't need anything fancy for your front end machine, but a second machine
> for fail-over would be good practice.
>
> Look at the Hadoop source code for examples of writing simple servers
> using
> Jetty (which is pretty sweet for embedding HTTP accessible services).  All
> of the service components in Hadoop use that style.
>
> More comments are in-line below.
>
>
>
>
> On 2/11/08 4:36 AM, "Shimi K" <sh...@gmail.com> wrote:
>
> > Here is the information based on your questions and some more
> information
> > about what I am trying to do.
> >
> >> A) first and most importantly, is your program batch oriented, or is it
> >> supposed to respond quickly to random requests?  If it is batch
> oriented,
> >> then it is likely that map-reduce will help.  If it is intended to
> respond
> >> to random requests, then it is unlikely to be a match.
> >
> >
> > My program process ad hoc real time queries.
>
> This means you should not use map-reduce.
>
> >> B) would do you intend to have a very large number of small files
> (large
> >> is  1 million files, very large is greater than 10 million) or are your
> files
> >> very small (small is less than 10MB or so, very small is less than
> 1MB).
> >
> >
> > Most of my files will be small files (around 10 MB). The number of files
> > will be around 200 k. Is HDFS not suitable for this amount of files? Is
> > there a better alternative?
>
> HDFS will work just fine for this.  There are alternative systems, but it
> is
> unlikely that any will work better for you than HDFS.
>
> > C) how long a program startup time can you allow?  Hadoop's map-reduce
> is
> >> oriented mostly around the batch processing of very large data sets
> which
> >> means that a fairly lengthy startup time is acceptable and even
> desirable
> >> if
> >> it allows faster overall throughput.  If you can't stand a startup time
> of
> >> 10 seconds or so, then you need a non-map-reduce design.
> >
> >
> > If by program startup you mean cluster startup then I don't mind if it
> will
> > take seconds or even minutes. If you mean job startup then every
> millisecond
> > is important. The system response time needs to be in milliseconds.
>
> I meant jobs startup.  You need to have query processors that run forever.
>
> > D) if you need real-time queries, can you use hbase?
> >>
> > What exactly do you mean? Do you mean Hadoop map reduce with Hbase
> instead
> > of Hadoop map reduce with HDFS?
>
> I mean hbase with storage in HDFS and no map-reduce except possibly to
> load
> the hbase tables.
>
> >> If you have such a program, then keeping track of what data is already
> in
> >> memory is pretty easy and
> >> using HDFS as a file store could be really good for your application.
> >
> > How can I keep track of what data is already in memory?
>
> If you have many servers running on HDFS nodes, then these servers can
> read
> files that relate to queries that they get and keep those files in memory
> until they get queries for different files and can't keep them any more.
>  If
> the front-end server passes requests to servers that probably have the
> files
> on local disk, then things should move very fast.
>
> > I have a code that does a complex search on binary files (non text
> files). I
> > need to build a system around this code that will meet the following
> > requirements:
> > * The system will get requests for search against all the files in the
> > system.
> > * The system will have 200 k of files. File size is around 10 Mb.
> > * The response time should be in milliseconds.
> > * The system needs to be able to response to multiple requests at the
> same
> > time. (I am not sure how much but I assume it will be around 10 per
> second).
>
> This should be easy to implement with the front-end/sub-query architecture
> that I described.  It will be hard (impossible) to implement using
> map-reduce.
>
> > I figured that I can use Hadoop for this purpose. I know that Hadoop was
> > built for batch processing but I thought I can use it anyway for my
> purpose.
>
> This was a good idea.
>
> > My plan was to do the search in the map part. I can split a file in the
> > cluster and then reduce the search results that came from the same file.
>
> This is a fine idea in principle, but the startup time for map reduce will
> kill you.
>
> > About the tmpfs suggestion that was mentioned here, Is it possible to to
> > upload the HDFS files from each node HDFS to the node ramfs and then to
> do
> > the map reduce on the ramfs?
>
> This is a non-starter because of the map-reduce startup time.  If you have
> custom worker nodes on each HDFS storage node, then this is irrelevant.
>
> > I hardly need to update the file system but
> > what will happen if I will want to delete, update or add file to the
> system?
>
> Make sure that your front end knows about the updates and then have it ask
> the workers to access the new versions of the files.
>
> HDFS is a write-once file system.  You can delete an old file, but you
> can't
> update.  You just write a new copy and delete the old one when you are
> sure
> that nobody is still using it.  That leads to nice update semantics.
>
> > Does any of you think that Hadoop is not the right choice for this kind
> of
> > job? Can you suggest something better?
>
> Hadoop is great for this.  Just not all of it!
>
>

Re: Caching frequently map input files

Posted by Ted Dunning <td...@veoh.com>.

You should be looking at HDFS (part of hadoop) plus hbase or code that you
write yourself.

Hadoop is built in two parts.  One part is the distributed file system that
provides replication and similar functions.  You can access this file system
pretty easily from Java.  Your requirements are pretty easy to meet using
HDFS.  HDFS provides the capabilities to write files (once without
modification after creation) and read them.  You can also tell where pieces
of a file are.  Your files are small enough so that they will fit in a
single HDFS block so their location will be simple.  HDFS is highly reliable
and can provide very high read bandwidth, especially for processes that run
on the same nodes as the storage nodes.

The second part is the map-reduce framework.  This handles the invocation of
map-reduce jobs which are batch oriented and take several seconds to start.
Hadoop's map reduce framework is not suitable for your application because
of your real-time query requirement.

Hbase is a database layer similar to Google's bigtables.  It might be
helpful to you, but it sounds like your queries are more complex than what
hbase provides.

My guess is that to satisfy your needs, you need to have a farm of query
servers that read files from HDFS, but which run essentially forever.  You
will need a front end server that accepts queries and forwards sub-queries
on to machines which read data from HDFS at startup time and then just
handle sub-queries.  Your front end machine should handle sending
sub-queries to nodes that have the data in question in local storage if
possible and should handle sending sub-queries to a different machine if an
answer doesn't come back when expected.  At 10 per second, you probably
don't need anything fancy for your front end machine, but a second machine
for fail-over would be good practice.

Look at the Hadoop source code for examples of writing simple servers using
Jetty (which is pretty sweet for embedding HTTP accessible services).  All
of the service components in Hadoop use that style.

More comments are in-line below.




On 2/11/08 4:36 AM, "Shimi K" <sh...@gmail.com> wrote:

> Here is the information based on your questions and some more information
> about what I am trying to do.
> 
>> A) first and most importantly, is your program batch oriented, or is it
>> supposed to respond quickly to random requests?  If it is batch oriented,
>> then it is likely that map-reduce will help.  If it is intended to respond
>> to random requests, then it is unlikely to be a match.
> 
> 
> My program process ad hoc real time queries.

This means you should not use map-reduce.

>> B) would do you intend to have a very large number of small files (large
>> is  1 million files, very large is greater than 10 million) or are your files
>> very small (small is less than 10MB or so, very small is less than 1MB).
> 
> 
> Most of my files will be small files (around 10 MB). The number of files
> will be around 200 k. Is HDFS not suitable for this amount of files? Is
> there a better alternative?

HDFS will work just fine for this.  There are alternative systems, but it is
unlikely that any will work better for you than HDFS.

> C) how long a program startup time can you allow?  Hadoop's map-reduce is
>> oriented mostly around the batch processing of very large data sets which
>> means that a fairly lengthy startup time is acceptable and even desirable
>> if
>> it allows faster overall throughput.  If you can't stand a startup time of
>> 10 seconds or so, then you need a non-map-reduce design.
> 
> 
> If by program startup you mean cluster startup then I don't mind if it will
> take seconds or even minutes. If you mean job startup then every millisecond
> is important. The system response time needs to be in milliseconds.

I meant jobs startup.  You need to have query processors that run forever.

> D) if you need real-time queries, can you use hbase?
>> 
> What exactly do you mean? Do you mean Hadoop map reduce with Hbase instead
> of Hadoop map reduce with HDFS?

I mean hbase with storage in HDFS and no map-reduce except possibly to load
the hbase tables.

>> If you have such a program, then keeping track of what data is already in
>> memory is pretty easy and
>> using HDFS as a file store could be really good for your application.
> 
> How can I keep track of what data is already in memory?

If you have many servers running on HDFS nodes, then these servers can read
files that relate to queries that they get and keep those files in memory
until they get queries for different files and can't keep them any more.  If
the front-end server passes requests to servers that probably have the files
on local disk, then things should move very fast.

> I have a code that does a complex search on binary files (non text files). I
> need to build a system around this code that will meet the following
> requirements:
> * The system will get requests for search against all the files in the
> system.
> * The system will have 200 k of files. File size is around 10 Mb.
> * The response time should be in milliseconds.
> * The system needs to be able to response to multiple requests at the same
> time. (I am not sure how much but I assume it will be around 10 per second).

This should be easy to implement with the front-end/sub-query architecture
that I described.  It will be hard (impossible) to implement using
map-reduce.

> I figured that I can use Hadoop for this purpose. I know that Hadoop was
> built for batch processing but I thought I can use it anyway for my purpose.

This was a good idea.

> My plan was to do the search in the map part. I can split a file in the
> cluster and then reduce the search results that came from the same file.

This is a fine idea in principle, but the startup time for map reduce will
kill you.

> About the tmpfs suggestion that was mentioned here, Is it possible to to
> upload the HDFS files from each node HDFS to the node ramfs and then to do
> the map reduce on the ramfs?

This is a non-starter because of the map-reduce startup time.  If you have
custom worker nodes on each HDFS storage node, then this is irrelevant.

> I hardly need to update the file system but
> what will happen if I will want to delete, update or add file to the system?

Make sure that your front end knows about the updates and then have it ask
the workers to access the new versions of the files.

HDFS is a write-once file system.  You can delete an old file, but you can't
update.  You just write a new copy and delete the old one when you are sure
that nobody is still using it.  That leads to nice update semantics.

> Does any of you think that Hadoop is not the right choice for this kind of
> job? Can you suggest something better?

Hadoop is great for this.  Just not all of it!


Re: Caching frequently map input files

Posted by Shimi K <sh...@gmail.com>.
Here is the information based on your questions and some more information
about what I am trying to do.

> A) first and most importantly, is your program batch oriented, or is it
> supposed to respond quickly to random requests?  If it is batch oriented,
> then it is likely that map-reduce will help.  If it is intended to respond
> to random requests, then it is unlikely to be a match.


My program process ad hoc real time queries.


> B) would do you intend to have a very large number of small files (large
> is
> >1 million files, very large is greater than 10 million) or are your files
> very small (small is less than 10MB or so, very small is less than 1MB).
>  If
> you need a very large number of files, you need to redesign your problem
> or
> look for a different file store.  If you are working with very small files
> that nevertheless fit into memory, then you may need to concatenate files
> together to get larger files.


Most of my files will be small files (around 10 MB). The number of files
will be around 200 k. Is HDFS not suitable for this amount of files? Is
there a better alternative?

C) how long a program startup time can you allow?  Hadoop's map-reduce is
> oriented mostly around the batch processing of very large data sets which
> means that a fairly lengthy startup time is acceptable and even desirable
> if
> it allows faster overall throughput.  If you can't stand a startup time of
> 10 seconds or so, then you need a non-map-reduce design.


If by program startup you mean cluster startup then I don't mind if it will
take seconds or even minutes. If you mean job startup then every millisecond
is important. The system response time needs to be in milliseconds. I can't
waste seconds for search that takes milliseconds.

D) if you need real-time queries, can you use hbase?
>
What exactly do you mean? Do you mean Hadoop map reduce with Hbase instead
of Hadoop map reduce with HDFS?


> Based on what you have said so far, it sounds like you either have batch
> oriented input for relatively small batches of inputs (less than 100,000
> or
> so) or you have a real-time query requirement.
>
I have a real-time query requirements.


> In either case, you may need to have a program that runs semi-permanently.

What do you mean by "program that runs semi-permanently"?

If you have such a program, then keeping track of what data is already in
> memory is pretty easy and
> using HDFS as a file store could be really good for your application.

How can I keep track of what data is already in memory?

One way to get such a long-lived program is to simply use hbase (if you
> can).

I am not sure if I can because my knowledge in Hbase is limited. Do you mean
Hbase instead of HDFS?


I have a code that does a complex search on binary files (non text files). I
need to build a system around this code that will meet the following
requirements:
* The system will get requests for search against all the files in the
system.
* The system will have 200 k of files. File size is around 10 Mb.
* The response time should be in milliseconds.
* The system needs to be able to response to multiple requests at the same
time. (I am not sure how much but I assume it will be around 10 per second).

I figured that I can use Hadoop for this purpose. I know that Hadoop was
built for batch processing but I thought I can use it anyway for my purpose.
My plan was to do the search in the map part. I can split a file in the
cluster and then reduce the search results that came from the same file. Or
since I need to do the search on all the files, I can tell Hadoop not to
split my files and each node will do the search on a different file. If we
take out the replication factor for this discussion, each node will always
do the search on the same files. Sine the files can fit into the machine
RAM, I do not want Hadoop to waste time on uploading those files over and
over again.

About the tmpfs suggestion that was mentioned here, Is it possible to to
upload the HDFS files from each node HDFS to the node ramfs and then to do
the map reduce on the ramfs? I hardly need to update the file system but
what will happen if I will want to delete, update or add file to the system?

Does any of you think that Hadoop is not the right choice for this kind of
job? Can you suggest something better?

On Feb 11, 2008 10:03 AM, Ted Dunning <td...@veoh.com> wrote:

>
> This begins to sound like you are trying to do something that is a very
> nice
> match to hadoop's map-reduce framework at all.  It may be that HDFS would
> be
> very helpful to you, but map-reduce may not be so much help.
>
> Here are a few question about your application that will help define the
> answer whether your application is a good map-reduce candidate:
>
> A) first and most importantly, is your program batch oriented, or is it
> supposed to respond quickly to random requests?  If it is batch oriented,
> then it is likely that map-reduce will help.  If it is intended to respond
> to random requests, then it is unlikely to be a match.
>
> B) would do you intend to have a very large number of small files (large
> is
> >1 million files, very large is greater than 10 million) or are your files
> very small (small is less than 10MB or so, very small is less than 1MB).
>  If
> you need a very large number of files, you need to redesign your problem
> or
> look for a different file store.  If you are working with very small files
> that nevertheless fit into memory, then you may need to concatenate files
> together to get larger files.
>
> C) how long a program startup time can you allow?  Hadoop's map-reduce is
> oriented mostly around the batch processing of very large data sets which
> means that a fairly lengthy startup time is acceptable and even desirable
> if
> it allows faster overall throughput.  If you can't stand a startup time of
> 10 seconds or so, then you need a non-map-reduce design.
>
> D) if you need real-time queries, can you use hbase?
>
> Based on what you have said so far, it sounds like you either have batch
> oriented input for relatively small batches of inputs (less than 100,000
> or
> so) or you have a real-time query requirement.  In either case, you may
> need
> to have a program that runs semi-permanently. If you have such a program,
> then keeping track of what data is already in memory is pretty easy and
> using HDFS as a file store could be really good for your application.  One
> way to get such a long-lived program is to simply use hbase (if you can).
> If that doesn't work for you, you might try using the map-file structure
> or
> lucene to implement your own long-running distributed search system.
>
> If you can be more specific about what you are trying to do, you are
> likely
> to get better answers.
>
> On 2/10/08 10:05 PM, "Shimi K" <sh...@gmail.com> wrote:
>
> > I choose Hadoop more for the distributed calculation then the support
> for
> > huge files and my files do fit into memory.
> > I have a lot of small files and my system needs to search for something
> in
> > those files very fast. I figured I can distribute the files on a Hadoop
> > cluster and then uses the distributed calculation to do the search in
> > parallel on many files as possible. This way I would be able to return a
> > result faster then if I would have used one machine.
> >
> > Is there a way to tell which files are in memory?
> >
> >
> > On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> >>
> >> But if your files DO fit into memory then the datanodes that have
> copies
> >> of
> >> the blocks of your file will probably still have them in memory and
> since
> >> maps are typically data local, you will benefit as much as possible.
> >>
> >>
> >> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
> >>
> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >>>> upload files
> >>>> from the disk each time a file is needed no matter if it was the
> >>>> same file
> >>>> that was required by the last job on the same node?
> >>>>
> >>>
> >>> There is no concept of caching input files across jobs.
> >>>
> >>> Hadoop is geared towards dealing with _huge_ amounts of data which
> >>> don't fit into memory anyway... and hence doing it across jobs is
> moot.
> >>
> >>
>
>

Re: Caching frequently map input files

Posted by Ted Dunning <td...@veoh.com>.
Are you trying to do ad hoc real-time queries or batches of queries?


On 2/11/08 12:03 AM, "Shimi K" <sh...@gmail.com> wrote:

> You misunderstood me. I do not want maps across different nodes to share
> their cache. I am not looking for data replication across nodes. Right now
> the amount of data I have on those files can fit into RAM. I want to use
> Hadoop to search those files in parallel. This way I will reduce the amount
> of time it will take me to search all the files on one machine. Even if the
> amount of data will grow beyond the RAM of a single machine, I have no
> problem adding additional machine to the cluster in order to make the search
> faster.
> 
> 90% of my jobs will do a search on all the files in the cluster! I want to
> make sure that each node will not waste time uploading (from the disk) the
> same file which it already uploaded for the previous job.
> 
> Shimi
> 
> 
> On Feb 11, 2008 9:10 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> 
>> Hi,
>> I totally missed what you wanted to convey. What you want is that the
>> maps(the tasks) should be able to share their caches across jobs. In
>> hadoop each task is separate JVM. So sharing caches across tasks is
>> sharing across JVM's and that too over time (i.e to make cache a
>> separate higher level entity) which I think I would not be possible.
>> What you can do is to increase the filesize i.e before uploading to the
>> DFS, concatenate the files. I dont think that will affect your algorithm
>> in any sense. Wherever you need to maintain the file boundary you could
>> do something like
>> concatenated file :
>> filename1 : data1
>> filename2 : data2
>> ..
>> And the map will take care of such hidden structures.
>> So the only way I think is to increase the filesize through concatenation.
>> Amar
>> Shimi K wrote:
>>> I choose Hadoop more for the distributed calculation then the support
>> for
>>> huge files and my files do fit into memory.
>>> I have a lot of small files and my system needs to search for something
>> in
>>> those files very fast. I figured I can distribute the files on a Hadoop
>>> cluster and then uses the distributed calculation to do the search in
>>> parallel on many files as possible. This way I would be able to return a
>>> result faster then if I would have used one machine.
>>> 
>>> Is there a way to tell which files are in memory?
>>> 
>>> 
>>> On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:
>>> 
>>> 
>>>> But if your files DO fit into memory then the datanodes that have
>> copies
>>>> of
>>>> the blocks of your file will probably still have them in memory and
>> since
>>>> maps are typically data local, you will benefit as much as possible.
>>>> 
>>>> 
>>>> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>>>> 
>>>> 
>>>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>>>> upload files
>>>>>> from the disk each time a file is needed no matter if it was the
>>>>>> same file
>>>>>> that was required by the last job on the same node?
>>>>>> 
>>>>>> 
>>>>> There is no concept of caching input files across jobs.
>>>>> 
>>>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>>>> don't fit into memory anyway... and hence doing it across jobs is
>> moot.
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 


Re: Caching frequently map input files

Posted by Shimi K <sh...@gmail.com>.
You misunderstood me. I do not want maps across different nodes to share
their cache. I am not looking for data replication across nodes. Right now
the amount of data I have on those files can fit into RAM. I want to use
Hadoop to search those files in parallel. This way I will reduce the amount
of time it will take me to search all the files on one machine. Even if the
amount of data will grow beyond the RAM of a single machine, I have no
problem adding additional machine to the cluster in order to make the search
faster.

90% of my jobs will do a search on all the files in the cluster! I want to
make sure that each node will not waste time uploading (from the disk) the
same file which it already uploaded for the previous job.

Shimi


On Feb 11, 2008 9:10 AM, Amar Kamat <am...@yahoo-inc.com> wrote:

> Hi,
> I totally missed what you wanted to convey. What you want is that the
> maps(the tasks) should be able to share their caches across jobs. In
> hadoop each task is separate JVM. So sharing caches across tasks is
> sharing across JVM's and that too over time (i.e to make cache a
> separate higher level entity) which I think I would not be possible.
> What you can do is to increase the filesize i.e before uploading to the
> DFS, concatenate the files. I dont think that will affect your algorithm
> in any sense. Wherever you need to maintain the file boundary you could
> do something like
> concatenated file :
> filename1 : data1
> filename2 : data2
> ..
> And the map will take care of such hidden structures.
> So the only way I think is to increase the filesize through concatenation.
> Amar
> Shimi K wrote:
> > I choose Hadoop more for the distributed calculation then the support
> for
> > huge files and my files do fit into memory.
> > I have a lot of small files and my system needs to search for something
> in
> > those files very fast. I figured I can distribute the files on a Hadoop
> > cluster and then uses the distributed calculation to do the search in
> > parallel on many files as possible. This way I would be able to return a
> > result faster then if I would have used one machine.
> >
> > Is there a way to tell which files are in memory?
> >
> >
> > On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> >
> >> But if your files DO fit into memory then the datanodes that have
> copies
> >> of
> >> the blocks of your file will probably still have them in memory and
> since
> >> maps are typically data local, you will benefit as much as possible.
> >>
> >>
> >> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
> >>
> >>
> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >>>> upload files
> >>>> from the disk each time a file is needed no matter if it was the
> >>>> same file
> >>>> that was required by the last job on the same node?
> >>>>
> >>>>
> >>> There is no concept of caching input files across jobs.
> >>>
> >>> Hadoop is geared towards dealing with _huge_ amounts of data which
> >>> don't fit into memory anyway... and hence doing it across jobs is
> moot.
> >>>
> >>
> >
> >
>
>

Re: Caching frequently map input files

Posted by Amar Kamat <am...@yahoo-inc.com>.
Hi,
I totally missed what you wanted to convey. What you want is that the 
maps(the tasks) should be able to share their caches across jobs. In 
hadoop each task is separate JVM. So sharing caches across tasks is 
sharing across JVM's and that too over time (i.e to make cache a 
separate higher level entity) which I think I would not be possible. 
What you can do is to increase the filesize i.e before uploading to the 
DFS, concatenate the files. I dont think that will affect your algorithm 
in any sense. Wherever you need to maintain the file boundary you could 
do something like
concatenated file :
filename1 : data1
filename2 : data2
..
And the map will take care of such hidden structures.
So the only way I think is to increase the filesize through concatenation.
Amar
Shimi K wrote:
> I choose Hadoop more for the distributed calculation then the support for
> huge files and my files do fit into memory.
> I have a lot of small files and my system needs to search for something in
> those files very fast. I figured I can distribute the files on a Hadoop
> cluster and then uses the distributed calculation to do the search in
> parallel on many files as possible. This way I would be able to return a
> result faster then if I would have used one machine.
>
> Is there a way to tell which files are in memory?
>
>
> On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:
>
>   
>> But if your files DO fit into memory then the datanodes that have copies
>> of
>> the blocks of your file will probably still have them in memory and since
>> maps are typically data local, you will benefit as much as possible.
>>
>>
>> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>>
>>     
>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>> upload files
>>>> from the disk each time a file is needed no matter if it was the
>>>> same file
>>>> that was required by the last job on the same node?
>>>>
>>>>         
>>> There is no concept of caching input files across jobs.
>>>
>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>> don't fit into memory anyway... and hence doing it across jobs is moot.
>>>       
>>     
>
>   


Re: Caching frequently map input files

Posted by Ted Dunning <td...@veoh.com>.
This begins to sound like you are trying to do something that is a very nice
match to hadoop's map-reduce framework at all.  It may be that HDFS would be
very helpful to you, but map-reduce may not be so much help.

Here are a few question about your application that will help define the
answer whether your application is a good map-reduce candidate:

A) first and most importantly, is your program batch oriented, or is it
supposed to respond quickly to random requests?  If it is batch oriented,
then it is likely that map-reduce will help.  If it is intended to respond
to random requests, then it is unlikely to be a match.

B) would do you intend to have a very large number of small files (large is
>1 million files, very large is greater than 10 million) or are your files
very small (small is less than 10MB or so, very small is less than 1MB).  If
you need a very large number of files, you need to redesign your problem or
look for a different file store.  If you are working with very small files
that nevertheless fit into memory, then you may need to concatenate files
together to get larger files.

C) how long a program startup time can you allow?  Hadoop's map-reduce is
oriented mostly around the batch processing of very large data sets which
means that a fairly lengthy startup time is acceptable and even desirable if
it allows faster overall throughput.  If you can't stand a startup time of
10 seconds or so, then you need a non-map-reduce design.

D) if you need real-time queries, can you use hbase?

Based on what you have said so far, it sounds like you either have batch
oriented input for relatively small batches of inputs (less than 100,000 or
so) or you have a real-time query requirement.  In either case, you may need
to have a program that runs semi-permanently. If you have such a program,
then keeping track of what data is already in memory is pretty easy and
using HDFS as a file store could be really good for your application.  One
way to get such a long-lived program is to simply use hbase (if you can).
If that doesn't work for you, you might try using the map-file structure or
lucene to implement your own long-running distributed search system.

If you can be more specific about what you are trying to do, you are likely
to get better answers.

On 2/10/08 10:05 PM, "Shimi K" <sh...@gmail.com> wrote:

> I choose Hadoop more for the distributed calculation then the support for
> huge files and my files do fit into memory.
> I have a lot of small files and my system needs to search for something in
> those files very fast. I figured I can distribute the files on a Hadoop
> cluster and then uses the distributed calculation to do the search in
> parallel on many files as possible. This way I would be able to return a
> result faster then if I would have used one machine.
> 
> Is there a way to tell which files are in memory?
> 
> 
> On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:
> 
>> 
>> But if your files DO fit into memory then the datanodes that have copies
>> of
>> the blocks of your file will probably still have them in memory and since
>> maps are typically data local, you will benefit as much as possible.
>> 
>> 
>> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>> 
>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>>>> upload files
>>>> from the disk each time a file is needed no matter if it was the
>>>> same file
>>>> that was required by the last job on the same node?
>>>> 
>>> 
>>> There is no concept of caching input files across jobs.
>>> 
>>> Hadoop is geared towards dealing with _huge_ amounts of data which
>>> don't fit into memory anyway... and hence doing it across jobs is moot.
>> 
>> 


RE: Caching frequently map input files

Posted by Joydeep Sen Sarma <js...@facebook.com>.
u could make a ramfs file system (totally in-memory) on each node and configure hdfs to use that.


-----Original Message-----
From: Shimi K [mailto:shimi.eng@gmail.com]
Sent: Sun 2/10/2008 10:05 PM
To: core-user@hadoop.apache.org
Subject: Re: Caching frequently map input files
 
I choose Hadoop more for the distributed calculation then the support for
huge files and my files do fit into memory.
I have a lot of small files and my system needs to search for something in
those files very fast. I figured I can distribute the files on a Hadoop
cluster and then uses the distributed calculation to do the search in
parallel on many files as possible. This way I would be able to return a
result faster then if I would have used one machine.

Is there a way to tell which files are in memory?


On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:

>
> But if your files DO fit into memory then the datanodes that have copies
> of
> the blocks of your file will probably still have them in memory and since
> maps are typically data local, you will benefit as much as possible.
>
>
> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>
> >> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >> upload files
> >> from the disk each time a file is needed no matter if it was the
> >> same file
> >> that was required by the last job on the same node?
> >>
> >
> > There is no concept of caching input files across jobs.
> >
> > Hadoop is geared towards dealing with _huge_ amounts of data which
> > don't fit into memory anyway... and hence doing it across jobs is moot.
>
>


Re: Caching frequently map input files

Posted by Shimi K <sh...@gmail.com>.
I choose Hadoop more for the distributed calculation then the support for
huge files and my files do fit into memory.
I have a lot of small files and my system needs to search for something in
those files very fast. I figured I can distribute the files on a Hadoop
cluster and then uses the distributed calculation to do the search in
parallel on many files as possible. This way I would be able to return a
result faster then if I would have used one machine.

Is there a way to tell which files are in memory?


On Feb 10, 2008 10:33 PM, Ted Dunning <td...@veoh.com> wrote:

>
> But if your files DO fit into memory then the datanodes that have copies
> of
> the blocks of your file will probably still have them in memory and since
> maps are typically data local, you will benefit as much as possible.
>
>
> On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:
>
> >> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
> >> upload files
> >> from the disk each time a file is needed no matter if it was the
> >> same file
> >> that was required by the last job on the same node?
> >>
> >
> > There is no concept of caching input files across jobs.
> >
> > Hadoop is geared towards dealing with _huge_ amounts of data which
> > don't fit into memory anyway... and hence doing it across jobs is moot.
>
>

Re: Caching frequently map input files

Posted by Ted Dunning <td...@veoh.com>.
But if your files DO fit into memory then the datanodes that have copies of
the blocks of your file will probably still have them in memory and since
maps are typically data local, you will benefit as much as possible.


On 2/10/08 11:17 AM, "Arun C Murthy" <ac...@yahoo-inc.com> wrote:

>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it
>> upload files
>> from the disk each time a file is needed no matter if it was the
>> same file
>> that was required by the last job on the same node?
>> 
> 
> There is no concept of caching input files across jobs.
> 
> Hadoop is geared towards dealing with _huge_ amounts of data which
> don't fit into memory anyway... and hence doing it across jobs is moot.


Re: Caching frequently map input files

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Shimi,

On Feb 10, 2008, at 10:32 AM, Shimi K wrote:

> Is Hadoop cache frequently/LRU/MRU map input files? Or does it  
> upload files
> from the disk each time a file is needed no matter if it was the  
> same file
> that was required by the last job on the same node?
>

There is no concept of caching input files across jobs.

Hadoop is geared towards dealing with _huge_ amounts of data which  
don't fit into memory anyway... and hence doing it across jobs is moot.

Arun

> I am currently using version 0.14.4
>
> - Shimi


Caching frequently map input files

Posted by Shimi K <sh...@gmail.com>.
Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
from the disk each time a file is needed no matter if it was the same file
that was required by the last job on the same node?

I am currently using version 0.14.4

- Shimi