You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Pierre Antoine Du Bois De Naurois <pa...@gmail.com> on 2012/05/17 22:38:48 UTC

is hadoop suitable for us?

Hello,

We have about 50 VMs and we want to distribute processing across them.
However these VMs share a huge data storage system and thus their "virtual"
HDD are all located in the same computer. Would Hadoop be useful for such
configuration? Could we use hadoop without HDFS? so that we can retrieve
and store everything in the same storage?

Thanks,
PA

Re: is hadoop suitable for us?

Posted by Michael Segel <mi...@hotmail.com>.

You are going to have to put HDFS on top of your SAN. 

The issue is that you introduce overhead and latencies by having attached storage rather than the drives physically on the bus within the case. 

Also I'm going to assume that your SAN is using RAID. 
One of the side effects of using a SAN is that you could reduce your replication factor from 3 to 2. 
(The SAN already protects you from disk failures if you're using RAID)


On May 17, 2012, at 11:10 PM, Pierre Antoine DuBoDeNa wrote:

> You used HDFS too? or storing everything on SAN immediately?
> 
> I don't have number of GB/TB (it might be about 2TB so not really that
> "huge") but they are more than 100 million documents to be processed. In a
> single machine currently we can process about 200.000 docs/day (several
> parsing, indexing, metadata extraction has to be done). So in the worst
> case we want to use the 50 VMs to distribute the processing..
> 
> 2012/5/17 Sagar Shukla <sa...@persistent.co.in>
> 
>> Hi PA,
>>    In my environment, we had a SAN storage and I/O was pretty good. So if
>> you have similar environment then I don't see any performance issues.
>> 
>> Just out of curiosity - what amount of data are you looking forward to
>> process ?
>> 
>> Regards,
>> Sagar
>> 
>> -----Original Message-----
>> From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com]
>> Sent: Thursday, May 17, 2012 8:29 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: is hadoop suitable for us?
>> 
>> Thanks Sagar, Mathias and Michael for your replies.
>> 
>> It seems we will have to go with hadoop even if I/O will be slow due to
>> our configuration.
>> 
>> I will try to update on how it worked for our case.
>> 
>> Best,
>> PA
>> 
>> 
>> 
>> 2012/5/17 Michael Segel <mi...@hotmail.com>
>> 
>>> The short answer is yes.
>>> The longer answer is that you will have to account for the latencies.
>>> 
>>> There is more but you get the idea..
>>> 
>>> Sent from my iPhone
>>> 
>>> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
>>> padbdn@gmail.com> wrote:
>>> 
>>>> We have large amount of text files that we want to process and index
>>> (plus
>>>> applying other algorithms).
>>>> 
>>>> The problem is that our configuration is share-everything while
>>>> hadoop
>>> has
>>>> a share-nothing configuration.
>>>> 
>>>> We have 50 VMs and not actual servers, and these share a huge
>>>> central storage. So using HDFS might not be really useful as
>>>> replication will not help, distribution of files have no meaning as
>>>> all files will be again located in the same HDD. I am afraid that
>>>> I/O will be very slow with or without HDFS. So i am wondering if it
>>>> will really help us to use hadoop/hbase/pig etc. to distribute and
>>>> do several parallel tasks.. or is "better" to install something
>>>> different (which i am not sure what). We heard myHadoop is better
>>>> for such kind of configurations, have any clue about it?
>>>> 
>>>> For example we now have a central mySQL to check if we have already
>>>> processed a document and keeping there several metadata. Soon we
>>>> will
>>> have
>>>> to distribute it as there is not enough space in one VM, But
>>>> Hadoop/HBase will be useful? we don't want to do any complex
>>>> join/sort of the data, we just want to do queries to check if
>>>> already processed a document, and if not to add it with several of
>> it's metadata.
>>>> 
>>>> We heard sungrid for example is another way to go but it's
>>>> commercial. We are somewhat lost.. so any help/ideas/suggestions are
>> appreciated.
>>>> 
>>>> Best,
>>>> PA
>>>> 
>>>> 
>>>> 
>>>> 2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> For your question if HADOOP can be used without HDFS, the answer is
>> Yes.
>>>>> Hadoop can be used with any kind of distributed file system.
>>>>> But I m not able to understand the problem statement clearly to
>>>>> advice
>>> my
>>>>> point of view.
>>>>> Are you processing text file and saving in distributed database??
>>>>> 
>>>>> Regards,
>>>>> Abhishek
>>>>> 
>>>>> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
>>>>> < padbdn@gmail.com> wrote:
>>>>> 
>>>>>> We want to distribute processing of text files.. processing of
>>>>>> large machine learning tasks, have a distributed database as we
>>>>>> have big
>>> amount
>>>>>> of data etc.
>>>>>> 
>>>>>> The problem is that each VM can have up to 2TB of data (limitation
>>>>>> of
>>>>> VM),
>>>>>> and we have 20TB of data. So we have to distribute the processing,
>>>>>> the database etc. But all those data will be in a shared huge
>>>>>> central file system.
>>>>>> 
>>>>>> We heard about myHadoop, but we are not sure why is that any
>>>>>> different
>>>>> from
>>>>>> Hadoop.
>>>>>> 
>>>>>> If we run hadoop/mapreduce without using HDFS? is that an option?
>>>>>> 
>>>>>> best,
>>>>>> PA
>>>>>> 
>>>>>> 
>>>>>> 2012/5/17 Mathias Herberts <ma...@gmail.com>
>>>>>> 
>>>>>>> Hadoop does not perform well with shared storage and vms.
>>>>>>> 
>>>>>>> The question should be asked first regarding what you're trying
>>>>>>> to
>>>>>> achieve,
>>>>>>> not about your infra.
>>>>>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
>>>>>>> padbdn@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> We have about 50 VMs and we want to distribute processing across
>>>>> them.
>>>>>>>> However these VMs share a huge data storage system and thus
>>>>>>>> their
>>>>>>> "virtual"
>>>>>>>> HDD are all located in the same computer. Would Hadoop be useful
>>>>>>>> for
>>>>>> such
>>>>>>>> configuration? Could we use hadoop without HDFS? so that we can
>>>>>> retrieve
>>>>>>>> and store everything in the same storage?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> PA
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> 
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>

Re: is hadoop suitable for us?

Posted by Luca Pireddu <pi...@crs4.it>.

We're using a multi-user Hadoop MapReduce installation with up to 100 
computing nodes, without HDFS.  Since we have a shared cluster and not 
all apps use Hadoop, we grow/shrink the Hadoop cluster as the load 
changes.  It's working, and because of our hardware setup performance is 
quite close to what we had with HDFS.  We're storing everything directly 
on the SAN.

The only problem so far has been trying to get the system to work 
without running the JT as root (I posted yesterday about that problem).


Luca




On 05/18/2012 06:10 AM, Pierre Antoine DuBoDeNa wrote:
> You used HDFS too? or storing everything on SAN immediately?
>
> I don't have number of GB/TB (it might be about 2TB so not really that
> "huge") but they are more than 100 million documents to be processed. In a
> single machine currently we can process about 200.000 docs/day (several
> parsing, indexing, metadata extraction has to be done). So in the worst
> case we want to use the 50 VMs to distribute the processing..
>
> 2012/5/17 Sagar Shukla<sa...@persistent.co.in>
>
>> Hi PA,
>>      In my environment, we had a SAN storage and I/O was pretty good. So if
>> you have similar environment then I don't see any performance issues.
>>
>> Just out of curiosity - what amount of data are you looking forward to
>> process ?
>>
>> Regards,
>> Sagar
>>
>> -----Original Message-----
>> From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com]
>> Sent: Thursday, May 17, 2012 8:29 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: is hadoop suitable for us?
>>
>> Thanks Sagar, Mathias and Michael for your replies.
>>
>> It seems we will have to go with hadoop even if I/O will be slow due to
>> our configuration.
>>
>> I will try to update on how it worked for our case.
>>
>> Best,
>> PA
>>
>>
>>
>> 2012/5/17 Michael Segel<mi...@hotmail.com>
>>
>>> The short answer is yes.
>>> The longer answer is that you will have to account for the latencies.
>>>
>>> There is more but you get the idea..
>>>
>>> Sent from my iPhone
>>>
>>> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois"<
>>> padbdn@gmail.com>  wrote:
>>>
>>>> We have large amount of text files that we want to process and index
>>> (plus
>>>> applying other algorithms).
>>>>
>>>> The problem is that our configuration is share-everything while
>>>> hadoop
>>> has
>>>> a share-nothing configuration.
>>>>
>>>> We have 50 VMs and not actual servers, and these share a huge
>>>> central storage. So using HDFS might not be really useful as
>>>> replication will not help, distribution of files have no meaning as
>>>> all files will be again located in the same HDD. I am afraid that
>>>> I/O will be very slow with or without HDFS. So i am wondering if it
>>>> will really help us to use hadoop/hbase/pig etc. to distribute and
>>>> do several parallel tasks.. or is "better" to install something
>>>> different (which i am not sure what). We heard myHadoop is better
>>>> for such kind of configurations, have any clue about it?
>>>>
>>>> For example we now have a central mySQL to check if we have already
>>>> processed a document and keeping there several metadata. Soon we
>>>> will
>>> have
>>>> to distribute it as there is not enough space in one VM, But
>>>> Hadoop/HBase will be useful? we don't want to do any complex
>>>> join/sort of the data, we just want to do queries to check if
>>>> already processed a document, and if not to add it with several of
>> it's metadata.
>>>>
>>>> We heard sungrid for example is another way to go but it's
>>>> commercial. We are somewhat lost.. so any help/ideas/suggestions are
>> appreciated.
>>>>
>>>> Best,
>>>> PA
>>>>
>>>>
>>>>
>>>> 2012/5/17 Abhishek Pratap Singh<ma...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> For your question if HADOOP can be used without HDFS, the answer is
>> Yes.
>>>>> Hadoop can be used with any kind of distributed file system.
>>>>> But I m not able to understand the problem statement clearly to
>>>>> advice
>>> my
>>>>> point of view.
>>>>> Are you processing text file and saving in distributed database??
>>>>>
>>>>> Regards,
>>>>> Abhishek
>>>>>
>>>>> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
>>>>> <  padbdn@gmail.com>  wrote:
>>>>>
>>>>>> We want to distribute processing of text files.. processing of
>>>>>> large machine learning tasks, have a distributed database as we
>>>>>> have big
>>> amount
>>>>>> of data etc.
>>>>>>
>>>>>> The problem is that each VM can have up to 2TB of data (limitation
>>>>>> of
>>>>> VM),
>>>>>> and we have 20TB of data. So we have to distribute the processing,
>>>>>> the database etc. But all those data will be in a shared huge
>>>>>> central file system.
>>>>>>
>>>>>> We heard about myHadoop, but we are not sure why is that any
>>>>>> different
>>>>> from
>>>>>> Hadoop.
>>>>>>
>>>>>> If we run hadoop/mapreduce without using HDFS? is that an option?
>>>>>>
>>>>>> best,
>>>>>> PA
>>>>>>
>>>>>>
>>>>>> 2012/5/17 Mathias Herberts<ma...@gmail.com>
>>>>>>
>>>>>>> Hadoop does not perform well with shared storage and vms.
>>>>>>>
>>>>>>> The question should be asked first regarding what you're trying
>>>>>>> to
>>>>>> achieve,
>>>>>>> not about your infra.
>>>>>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois"<
>>>>>>> padbdn@gmail.com>  wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We have about 50 VMs and we want to distribute processing across
>>>>> them.
>>>>>>>> However these VMs share a huge data storage system and thus
>>>>>>>> their
>>>>>>> "virtual"
>>>>>>>> HDD are all located in the same computer. Would Hadoop be useful
>>>>>>>> for
>>>>>> such
>>>>>>>> configuration? Could we use hadoop without HDFS? so that we can
>>>>>> retrieve
>>>>>>>> and store everything in the same storage?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> PA

>


-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452

Re: is hadoop suitable for us?

Posted by Pierre Antoine DuBoDeNa <pa...@gmail.com>.

You used HDFS too? or storing everything on SAN immediately?

I don't have number of GB/TB (it might be about 2TB so not really that
"huge") but they are more than 100 million documents to be processed. In a
single machine currently we can process about 200.000 docs/day (several
parsing, indexing, metadata extraction has to be done). So in the worst
case we want to use the 50 VMs to distribute the processing..

2012/5/17 Sagar Shukla <sa...@persistent.co.in>

> Hi PA,
>     In my environment, we had a SAN storage and I/O was pretty good. So if
> you have similar environment then I don't see any performance issues.
>
> Just out of curiosity - what amount of data are you looking forward to
> process ?
>
> Regards,
> Sagar
>
> -----Original Message-----
> From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com]
> Sent: Thursday, May 17, 2012 8:29 PM
> To: common-user@hadoop.apache.org
> Subject: Re: is hadoop suitable for us?
>
> Thanks Sagar, Mathias and Michael for your replies.
>
> It seems we will have to go with hadoop even if I/O will be slow due to
> our configuration.
>
> I will try to update on how it worked for our case.
>
> Best,
> PA
>
>
>
> 2012/5/17 Michael Segel <mi...@hotmail.com>
>
> > The short answer is yes.
> > The longer answer is that you will have to account for the latencies.
> >
> > There is more but you get the idea..
> >
> > Sent from my iPhone
> >
> > On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
> > padbdn@gmail.com> wrote:
> >
> > > We have large amount of text files that we want to process and index
> > (plus
> > > applying other algorithms).
> > >
> > > The problem is that our configuration is share-everything while
> > > hadoop
> > has
> > > a share-nothing configuration.
> > >
> > > We have 50 VMs and not actual servers, and these share a huge
> > > central storage. So using HDFS might not be really useful as
> > > replication will not help, distribution of files have no meaning as
> > > all files will be again located in the same HDD. I am afraid that
> > > I/O will be very slow with or without HDFS. So i am wondering if it
> > > will really help us to use hadoop/hbase/pig etc. to distribute and
> > > do several parallel tasks.. or is "better" to install something
> > > different (which i am not sure what). We heard myHadoop is better
> > > for such kind of configurations, have any clue about it?
> > >
> > > For example we now have a central mySQL to check if we have already
> > > processed a document and keeping there several metadata. Soon we
> > > will
> > have
> > > to distribute it as there is not enough space in one VM, But
> > > Hadoop/HBase will be useful? we don't want to do any complex
> > > join/sort of the data, we just want to do queries to check if
> > > already processed a document, and if not to add it with several of
> it's metadata.
> > >
> > > We heard sungrid for example is another way to go but it's
> > > commercial. We are somewhat lost.. so any help/ideas/suggestions are
> appreciated.
> > >
> > > Best,
> > > PA
> > >
> > >
> > >
> > > 2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>
> > >
> > >> Hi,
> > >>
> > >> For your question if HADOOP can be used without HDFS, the answer is
> Yes.
> > >> Hadoop can be used with any kind of distributed file system.
> > >> But I m not able to understand the problem statement clearly to
> > >> advice
> > my
> > >> point of view.
> > >> Are you processing text file and saving in distributed database??
> > >>
> > >> Regards,
> > >> Abhishek
> > >>
> > >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
> > >> < padbdn@gmail.com> wrote:
> > >>
> > >>> We want to distribute processing of text files.. processing of
> > >>> large machine learning tasks, have a distributed database as we
> > >>> have big
> > amount
> > >>> of data etc.
> > >>>
> > >>> The problem is that each VM can have up to 2TB of data (limitation
> > >>> of
> > >> VM),
> > >>> and we have 20TB of data. So we have to distribute the processing,
> > >>> the database etc. But all those data will be in a shared huge
> > >>> central file system.
> > >>>
> > >>> We heard about myHadoop, but we are not sure why is that any
> > >>> different
> > >> from
> > >>> Hadoop.
> > >>>
> > >>> If we run hadoop/mapreduce without using HDFS? is that an option?
> > >>>
> > >>> best,
> > >>> PA
> > >>>
> > >>>
> > >>> 2012/5/17 Mathias Herberts <ma...@gmail.com>
> > >>>
> > >>>> Hadoop does not perform well with shared storage and vms.
> > >>>>
> > >>>> The question should be asked first regarding what you're trying
> > >>>> to
> > >>> achieve,
> > >>>> not about your infra.
> > >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> > >>>> padbdn@gmail.com> wrote:
> > >>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> We have about 50 VMs and we want to distribute processing across
> > >> them.
> > >>>>> However these VMs share a huge data storage system and thus
> > >>>>> their
> > >>>> "virtual"
> > >>>>> HDD are all located in the same computer. Would Hadoop be useful
> > >>>>> for
> > >>> such
> > >>>>> configuration? Could we use hadoop without HDFS? so that we can
> > >>> retrieve
> > >>>>> and store everything in the same storage?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> PA
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

RE: is hadoop suitable for us?

Posted by Sagar Shukla <sa...@persistent.co.in>.

Hi PA,
     In my environment, we had a SAN storage and I/O was pretty good. So if you have similar environment then I don't see any performance issues.

Just out of curiosity - what amount of data are you looking forward to process ?

Regards,
Sagar

-----Original Message-----
From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com] 
Sent: Thursday, May 17, 2012 8:29 PM
To: common-user@hadoop.apache.org
Subject: Re: is hadoop suitable for us?

Thanks Sagar, Mathias and Michael for your replies.

It seems we will have to go with hadoop even if I/O will be slow due to our configuration.

I will try to update on how it worked for our case.

Best,
PA



2012/5/17 Michael Segel <mi...@hotmail.com>

> The short answer is yes.
> The longer answer is that you will have to account for the latencies.
>
> There is more but you get the idea..
>
> Sent from my iPhone
>
> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" < 
> padbdn@gmail.com> wrote:
>
> > We have large amount of text files that we want to process and index
> (plus
> > applying other algorithms).
> >
> > The problem is that our configuration is share-everything while 
> > hadoop
> has
> > a share-nothing configuration.
> >
> > We have 50 VMs and not actual servers, and these share a huge 
> > central storage. So using HDFS might not be really useful as 
> > replication will not help, distribution of files have no meaning as 
> > all files will be again located in the same HDD. I am afraid that 
> > I/O will be very slow with or without HDFS. So i am wondering if it 
> > will really help us to use hadoop/hbase/pig etc. to distribute and 
> > do several parallel tasks.. or is "better" to install something 
> > different (which i am not sure what). We heard myHadoop is better 
> > for such kind of configurations, have any clue about it?
> >
> > For example we now have a central mySQL to check if we have already 
> > processed a document and keeping there several metadata. Soon we 
> > will
> have
> > to distribute it as there is not enough space in one VM, But 
> > Hadoop/HBase will be useful? we don't want to do any complex 
> > join/sort of the data, we just want to do queries to check if 
> > already processed a document, and if not to add it with several of it's metadata.
> >
> > We heard sungrid for example is another way to go but it's 
> > commercial. We are somewhat lost.. so any help/ideas/suggestions are appreciated.
> >
> > Best,
> > PA
> >
> >
> >
> > 2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>
> >
> >> Hi,
> >>
> >> For your question if HADOOP can be used without HDFS, the answer is Yes.
> >> Hadoop can be used with any kind of distributed file system.
> >> But I m not able to understand the problem statement clearly to 
> >> advice
> my
> >> point of view.
> >> Are you processing text file and saving in distributed database??
> >>
> >> Regards,
> >> Abhishek
> >>
> >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois 
> >> < padbdn@gmail.com> wrote:
> >>
> >>> We want to distribute processing of text files.. processing of 
> >>> large machine learning tasks, have a distributed database as we 
> >>> have big
> amount
> >>> of data etc.
> >>>
> >>> The problem is that each VM can have up to 2TB of data (limitation 
> >>> of
> >> VM),
> >>> and we have 20TB of data. So we have to distribute the processing, 
> >>> the database etc. But all those data will be in a shared huge 
> >>> central file system.
> >>>
> >>> We heard about myHadoop, but we are not sure why is that any 
> >>> different
> >> from
> >>> Hadoop.
> >>>
> >>> If we run hadoop/mapreduce without using HDFS? is that an option?
> >>>
> >>> best,
> >>> PA
> >>>
> >>>
> >>> 2012/5/17 Mathias Herberts <ma...@gmail.com>
> >>>
> >>>> Hadoop does not perform well with shared storage and vms.
> >>>>
> >>>> The question should be asked first regarding what you're trying 
> >>>> to
> >>> achieve,
> >>>> not about your infra.
> >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" < 
> >>>> padbdn@gmail.com> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We have about 50 VMs and we want to distribute processing across
> >> them.
> >>>>> However these VMs share a huge data storage system and thus 
> >>>>> their
> >>>> "virtual"
> >>>>> HDD are all located in the same computer. Would Hadoop be useful 
> >>>>> for
> >>> such
> >>>>> configuration? Could we use hadoop without HDFS? so that we can
> >>> retrieve
> >>>>> and store everything in the same storage?
> >>>>>
> >>>>> Thanks,
> >>>>> PA
> >>>>>
> >>>>
> >>>
> >>
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: is hadoop suitable for us?

Posted by Pierre Antoine Du Bois De Naurois <pa...@gmail.com>.

Thanks Sagar, Mathias and Michael for your replies.

It seems we will have to go with hadoop even if I/O will be slow due to our
configuration.

I will try to update on how it worked for our case.

Best,
PA



2012/5/17 Michael Segel <mi...@hotmail.com>

> The short answer is yes.
> The longer answer is that you will have to account for the latencies.
>
> There is more but you get the idea..
>
> Sent from my iPhone
>
> On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <
> padbdn@gmail.com> wrote:
>
> > We have large amount of text files that we want to process and index
> (plus
> > applying other algorithms).
> >
> > The problem is that our configuration is share-everything while hadoop
> has
> > a share-nothing configuration.
> >
> > We have 50 VMs and not actual servers, and these share a huge central
> > storage. So using HDFS might not be really useful as replication will not
> > help, distribution of files have no meaning as all files will be again
> > located in the same HDD. I am afraid that I/O will be very slow with or
> > without HDFS. So i am wondering if it will really help us to use
> > hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is
> > "better" to install something different (which i am not sure what). We
> > heard myHadoop is better for such kind of configurations, have any clue
> > about it?
> >
> > For example we now have a central mySQL to check if we have already
> > processed a document and keeping there several metadata. Soon we will
> have
> > to distribute it as there is not enough space in one VM, But Hadoop/HBase
> > will be useful? we don't want to do any complex join/sort of the data, we
> > just want to do queries to check if already processed a document, and if
> > not to add it with several of it's metadata.
> >
> > We heard sungrid for example is another way to go but it's commercial. We
> > are somewhat lost.. so any help/ideas/suggestions are appreciated.
> >
> > Best,
> > PA
> >
> >
> >
> > 2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>
> >
> >> Hi,
> >>
> >> For your question if HADOOP can be used without HDFS, the answer is Yes.
> >> Hadoop can be used with any kind of distributed file system.
> >> But I m not able to understand the problem statement clearly to advice
> my
> >> point of view.
> >> Are you processing text file and saving in distributed database??
> >>
> >> Regards,
> >> Abhishek
> >>
> >> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois <
> >> padbdn@gmail.com> wrote:
> >>
> >>> We want to distribute processing of text files.. processing of large
> >>> machine learning tasks, have a distributed database as we have big
> amount
> >>> of data etc.
> >>>
> >>> The problem is that each VM can have up to 2TB of data (limitation of
> >> VM),
> >>> and we have 20TB of data. So we have to distribute the processing, the
> >>> database etc. But all those data will be in a shared huge central file
> >>> system.
> >>>
> >>> We heard about myHadoop, but we are not sure why is that any different
> >> from
> >>> Hadoop.
> >>>
> >>> If we run hadoop/mapreduce without using HDFS? is that an option?
> >>>
> >>> best,
> >>> PA
> >>>
> >>>
> >>> 2012/5/17 Mathias Herberts <ma...@gmail.com>
> >>>
> >>>> Hadoop does not perform well with shared storage and vms.
> >>>>
> >>>> The question should be asked first regarding what you're trying to
> >>> achieve,
> >>>> not about your infra.
> >>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> >>>> padbdn@gmail.com> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We have about 50 VMs and we want to distribute processing across
> >> them.
> >>>>> However these VMs share a huge data storage system and thus their
> >>>> "virtual"
> >>>>> HDD are all located in the same computer. Would Hadoop be useful for
> >>> such
> >>>>> configuration? Could we use hadoop without HDFS? so that we can
> >>> retrieve
> >>>>> and store everything in the same storage?
> >>>>>
> >>>>> Thanks,
> >>>>> PA
> >>>>>
> >>>>
> >>>
> >>
>

Re: is hadoop suitable for us?

Posted by Michael Segel <mi...@hotmail.com>.

The short answer is yes. 
The longer answer is that you will have to account for the latencies.

There is more but you get the idea..

Sent from my iPhone

On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois" <pa...@gmail.com> wrote:

> We have large amount of text files that we want to process and index (plus
> applying other algorithms).
> 
> The problem is that our configuration is share-everything while hadoop has
> a share-nothing configuration.
> 
> We have 50 VMs and not actual servers, and these share a huge central
> storage. So using HDFS might not be really useful as replication will not
> help, distribution of files have no meaning as all files will be again
> located in the same HDD. I am afraid that I/O will be very slow with or
> without HDFS. So i am wondering if it will really help us to use
> hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is
> "better" to install something different (which i am not sure what). We
> heard myHadoop is better for such kind of configurations, have any clue
> about it?
> 
> For example we now have a central mySQL to check if we have already
> processed a document and keeping there several metadata. Soon we will have
> to distribute it as there is not enough space in one VM, But Hadoop/HBase
> will be useful? we don't want to do any complex join/sort of the data, we
> just want to do queries to check if already processed a document, and if
> not to add it with several of it's metadata.
> 
> We heard sungrid for example is another way to go but it's commercial. We
> are somewhat lost.. so any help/ideas/suggestions are appreciated.
> 
> Best,
> PA
> 
> 
> 
> 2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>
> 
>> Hi,
>> 
>> For your question if HADOOP can be used without HDFS, the answer is Yes.
>> Hadoop can be used with any kind of distributed file system.
>> But I m not able to understand the problem statement clearly to advice my
>> point of view.
>> Are you processing text file and saving in distributed database??
>> 
>> Regards,
>> Abhishek
>> 
>> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois <
>> padbdn@gmail.com> wrote:
>> 
>>> We want to distribute processing of text files.. processing of large
>>> machine learning tasks, have a distributed database as we have big amount
>>> of data etc.
>>> 
>>> The problem is that each VM can have up to 2TB of data (limitation of
>> VM),
>>> and we have 20TB of data. So we have to distribute the processing, the
>>> database etc. But all those data will be in a shared huge central file
>>> system.
>>> 
>>> We heard about myHadoop, but we are not sure why is that any different
>> from
>>> Hadoop.
>>> 
>>> If we run hadoop/mapreduce without using HDFS? is that an option?
>>> 
>>> best,
>>> PA
>>> 
>>> 
>>> 2012/5/17 Mathias Herberts <ma...@gmail.com>
>>> 
>>>> Hadoop does not perform well with shared storage and vms.
>>>> 
>>>> The question should be asked first regarding what you're trying to
>>> achieve,
>>>> not about your infra.
>>>> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
>>>> padbdn@gmail.com> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> We have about 50 VMs and we want to distribute processing across
>> them.
>>>>> However these VMs share a huge data storage system and thus their
>>>> "virtual"
>>>>> HDD are all located in the same computer. Would Hadoop be useful for
>>> such
>>>>> configuration? Could we use hadoop without HDFS? so that we can
>>> retrieve
>>>>> and store everything in the same storage?
>>>>> 
>>>>> Thanks,
>>>>> PA
>>>>> 
>>>> 
>>> 
>>

RE: is hadoop suitable for us?

Posted by Sagar Shukla <sa...@persistent.co.in>.

Hi PA,
       Thanks for the detailed explanation of your environment.

Based on some of my experiences with Hadoop so far, following is my recommendation:
If you plan to process huge documents regularly and generate the index of the metadata, then hadoop is the way to do. I am not sure about the frequency and the size of the data that you are talking about. Generally, Hadoop is used where you need to process GBs and TBs of data at regular intervals.

As far as storage is concerned, it can be used in multiple ways. It is not necessary that you process the data and store it in HDFS only. You should be able to output the indexes / metadata and store it on the filesystem as well. If you intend to use HDFS for distributed redundancy capabilities of Hadoop and if you have SAN storage then you can create LUNs for each of the VMs and mount them, so that though the data is stored on a single storage, but is visible as distributed to the VMs. Though being a single storage, it provided distributed and fast processing capabilities through the use of VMs.

Hope this helps.

Thanks,
Sagar

-----Original Message-----
From: Pierre Antoine Du Bois De Naurois [mailto:padbdn@gmail.com] 
Sent: Thursday, May 17, 2012 6:33 PM
To: common-user@hadoop.apache.org
Subject: Re: is hadoop suitable for us?

We have large amount of text files that we want to process and index (plus applying other algorithms).

The problem is that our configuration is share-everything while hadoop has a share-nothing configuration.

We have 50 VMs and not actual servers, and these share a huge central storage. So using HDFS might not be really useful as replication will not help, distribution of files have no meaning as all files will be again located in the same HDD. I am afraid that I/O will be very slow with or without HDFS. So i am wondering if it will really help us to use hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is "better" to install something different (which i am not sure what). We heard myHadoop is better for such kind of configurations, have any clue about it?

For example we now have a central mySQL to check if we have already processed a document and keeping there several metadata. Soon we will have to distribute it as there is not enough space in one VM, But Hadoop/HBase will be useful? we don't want to do any complex join/sort of the data, we just want to do queries to check if already processed a document, and if not to add it with several of it's metadata.

We heard sungrid for example is another way to go but it's commercial. We are somewhat lost.. so any help/ideas/suggestions are appreciated.

Best,
PA

2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>

> Hi,
>
> For your question if HADOOP can be used without HDFS, the answer is Yes.
> Hadoop can be used with any kind of distributed file system.
> But I m not able to understand the problem statement clearly to advice 
> my point of view.
> Are you processing text file and saving in distributed database??
>
> Regards,
> Abhishek
>
> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois < 
> padbdn@gmail.com> wrote:
>
> > We want to distribute processing of text files.. processing of large 
> > machine learning tasks, have a distributed database as we have big 
> > amount of data etc.
> >
> > The problem is that each VM can have up to 2TB of data (limitation 
> > of
> VM),
> > and we have 20TB of data. So we have to distribute the processing, 
> > the database etc. But all those data will be in a shared huge 
> > central file system.
> >
> > We heard about myHadoop, but we are not sure why is that any 
> > different
> from
> > Hadoop.
> >
> > If we run hadoop/mapreduce without using HDFS? is that an option?
> >
> > best,
> > PA
> >
> >
> > 2012/5/17 Mathias Herberts <ma...@gmail.com>
> >
> > > Hadoop does not perform well with shared storage and vms.
> > >
> > > The question should be asked first regarding what you're trying to
> > achieve,
> > > not about your infra.
> > > On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" < 
> > > padbdn@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > We have about 50 VMs and we want to distribute processing across
> them.
> > > > However these VMs share a huge data storage system and thus 
> > > > their
> > > "virtual"
> > > > HDD are all located in the same computer. Would Hadoop be useful 
> > > > for
> > such
> > > > configuration? Could we use hadoop without HDFS? so that we can
> > retrieve
> > > > and store everything in the same storage?
> > > >
> > > > Thanks,
> > > > PA
> > > >
> > >
> >
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: is hadoop suitable for us?

Posted by Pierre Antoine Du Bois De Naurois <pa...@gmail.com>.

We have large amount of text files that we want to process and index (plus
applying other algorithms).

The problem is that our configuration is share-everything while hadoop has
a share-nothing configuration.

We have 50 VMs and not actual servers, and these share a huge central
storage. So using HDFS might not be really useful as replication will not
help, distribution of files have no meaning as all files will be again
located in the same HDD. I am afraid that I/O will be very slow with or
without HDFS. So i am wondering if it will really help us to use
hadoop/hbase/pig etc. to distribute and do several parallel tasks.. or is
"better" to install something different (which i am not sure what). We
heard myHadoop is better for such kind of configurations, have any clue
about it?

For example we now have a central mySQL to check if we have already
processed a document and keeping there several metadata. Soon we will have
to distribute it as there is not enough space in one VM, But Hadoop/HBase
will be useful? we don't want to do any complex join/sort of the data, we
just want to do queries to check if already processed a document, and if
not to add it with several of it's metadata.

We heard sungrid for example is another way to go but it's commercial. We
are somewhat lost.. so any help/ideas/suggestions are appreciated.

Best,
PA

2012/5/17 Abhishek Pratap Singh <ma...@gmail.com>

> Hi,
>
> For your question if HADOOP can be used without HDFS, the answer is Yes.
> Hadoop can be used with any kind of distributed file system.
> But I m not able to understand the problem statement clearly to advice my
> point of view.
> Are you processing text file and saving in distributed database??
>
> Regards,
> Abhishek
>
> On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois <
> padbdn@gmail.com> wrote:
>
> > We want to distribute processing of text files.. processing of large
> > machine learning tasks, have a distributed database as we have big amount
> > of data etc.
> >
> > The problem is that each VM can have up to 2TB of data (limitation of
> VM),
> > and we have 20TB of data. So we have to distribute the processing, the
> > database etc. But all those data will be in a shared huge central file
> > system.
> >
> > We heard about myHadoop, but we are not sure why is that any different
> from
> > Hadoop.
> >
> > If we run hadoop/mapreduce without using HDFS? is that an option?
> >
> > best,
> > PA
> >
> >
> > 2012/5/17 Mathias Herberts <ma...@gmail.com>
> >
> > > Hadoop does not perform well with shared storage and vms.
> > >
> > > The question should be asked first regarding what you're trying to
> > achieve,
> > > not about your infra.
> > > On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> > > padbdn@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > We have about 50 VMs and we want to distribute processing across
> them.
> > > > However these VMs share a huge data storage system and thus their
> > > "virtual"
> > > > HDD are all located in the same computer. Would Hadoop be useful for
> > such
> > > > configuration? Could we use hadoop without HDFS? so that we can
> > retrieve
> > > > and store everything in the same storage?
> > > >
> > > > Thanks,
> > > > PA
> > > >
> > >
> >
>

Re: is hadoop suitable for us?

Posted by Abhishek Pratap Singh <ma...@gmail.com>.

Hi,

For your question if HADOOP can be used without HDFS, the answer is Yes.
Hadoop can be used with any kind of distributed file system.
But I m not able to understand the problem statement clearly to advice my
point of view.
Are you processing text file and saving in distributed database??

Regards,
Abhishek

On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois <
padbdn@gmail.com> wrote:

> We want to distribute processing of text files.. processing of large
> machine learning tasks, have a distributed database as we have big amount
> of data etc.
>
> The problem is that each VM can have up to 2TB of data (limitation of VM),
> and we have 20TB of data. So we have to distribute the processing, the
> database etc. But all those data will be in a shared huge central file
> system.
>
> We heard about myHadoop, but we are not sure why is that any different from
> Hadoop.
>
> If we run hadoop/mapreduce without using HDFS? is that an option?
>
> best,
> PA
>
>
> 2012/5/17 Mathias Herberts <ma...@gmail.com>
>
> > Hadoop does not perform well with shared storage and vms.
> >
> > The question should be asked first regarding what you're trying to
> achieve,
> > not about your infra.
> > On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> > padbdn@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > We have about 50 VMs and we want to distribute processing across them.
> > > However these VMs share a huge data storage system and thus their
> > "virtual"
> > > HDD are all located in the same computer. Would Hadoop be useful for
> such
> > > configuration? Could we use hadoop without HDFS? so that we can
> retrieve
> > > and store everything in the same storage?
> > >
> > > Thanks,
> > > PA
> > >
> >
>

Re: is hadoop suitable for us?

Posted by Pierre Antoine Du Bois De Naurois <pa...@gmail.com>.

We want to distribute processing of text files.. processing of large
machine learning tasks, have a distributed database as we have big amount
of data etc.

The problem is that each VM can have up to 2TB of data (limitation of VM),
and we have 20TB of data. So we have to distribute the processing, the
database etc. But all those data will be in a shared huge central file
system.

We heard about myHadoop, but we are not sure why is that any different from
Hadoop.

If we run hadoop/mapreduce without using HDFS? is that an option?

best,
PA

2012/5/17 Mathias Herberts <ma...@gmail.com>

> Hadoop does not perform well with shared storage and vms.
>
> The question should be asked first regarding what you're trying to achieve,
> not about your infra.
> On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
> padbdn@gmail.com> wrote:
>
> > Hello,
> >
> > We have about 50 VMs and we want to distribute processing across them.
> > However these VMs share a huge data storage system and thus their
> "virtual"
> > HDD are all located in the same computer. Would Hadoop be useful for such
> > configuration? Could we use hadoop without HDFS? so that we can retrieve
> > and store everything in the same storage?
> >
> > Thanks,
> > PA
> >
>

Re: is hadoop suitable for us?

Posted by Mathias Herberts <ma...@gmail.com>.

Hadoop does not perform well with shared storage and vms.

The question should be asked first regarding what you're trying to achieve,
not about your infra.
On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois" <
padbdn@gmail.com> wrote:

> Hello,
>
> We have about 50 VMs and we want to distribute processing across them.
> However these VMs share a huge data storage system and thus their "virtual"
> HDD are all located in the same computer. Would Hadoop be useful for such
> configuration? Could we use hadoop without HDFS? so that we can retrieve
> and store everything in the same storage?
>
> Thanks,
> PA
>