You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sam Seigal <se...@yahoo.com> on 2011/10/01 01:02:01 UTC

incremental loads into hadoop

Hi,

I am relatively new to Hadoop and was wondering how to do incremental
loads into HDFS.

I have a continuous stream of data flowing into a service which is
writing to an OLTP store. Due to the high volume of data, we cannot do
aggregations on the OLTP store, since this starts affecting the write
performance.

We would like to offload this processing into a Hadoop cluster, mainly
for doing aggregations/analytics.

The question is how can this continuous stream of data be
incrementally loaded and processed into Hadoop ?

Thank you,

Sam

Re: incremental loads into hadoop

Posted by Mohit Anchlia <mo...@gmail.com>.
This process of managing looks like more pain long term. Would it be
easier to store in Hbase which has smaller block size?

What's the avg. file size?

On Sun, Oct 2, 2011 at 7:34 PM, Vitthal "Suhas" Gogate
<go...@hortonworks.com> wrote:
> Agree with Bejoy, although to minimize the processing latency you can still
> choose to write more frequently to HDFS resulting into more number of
> smaller size files on HDFS rather than waiting to accumulate large size data
> before writing to HDFS.  As you may have more number of smaller files, it
> may be good to use combine file input format to not have large number of
> very small map tasks (one per file if less than block size).  Now after you
> process the input data, you may not want to leave these large number of
> small files on HDFS and hence  you can use a Hadoop Archive (HAR) tool to
> combine and store them into small number of bigger size files.. You can run
> this tool periodically in the background to archive the input that is
> already processed..  Archive tool itself is implemented as M/R job.
>
> Also to get some level of atomicity, you may copy the data to HDFS at a
> temporary location before moving it to final source partition (or
> directory).  Existing data loading tools may be doing that already.
>
> --Suhas Gogate
>
>
> On Sun, Oct 2, 2011 at 11:12 AM, <be...@gmail.com> wrote:
>
>> Sam
>>     Your understanding is right, hadoop  definitely works great with large
>> volume of data. But not necessarily every file should be in the range of
>> Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data,
>> It is the total data processed by a map reduce job(rather jobs, most use
>> cases uses more than one map reduce job for processing). It can be 10K files
>> that make up the whole data.  Why not large number of small files? The over
>> head on the name node in housekeeping all these large amount of meta
>> data(file- block information) would be huge and there is definitely limits
>> to it. But you can store smaller files together in splittable compressed
>> formats. In general It is better to keep your file sizes atleast  same or
>> more than your hdfs block size. In default it is 64Mb but larger clusters
>> have higher values as multiples of 64. If your hdfs block size or your file
>> sizes are lesser than the map reduce input split size then it is better
>> using InputFormats like CombinedInput Format or so for MR jobs. Usually the
>> MR input split size is equal to your hdfs block size. In short as a better
>> practice your single file size should be at least equal to one hdfs block
>> size.
>>
>> The approach of keeping a file opened for long to write and then reading
>> the same parallely with a  map reduce, I fear it would work. AFAIK it won't.
>> When a write is going on some blocks or the file itself would be locked, not
>> really sure its the full file being locked or not. In short some blocks
>> wouldn't be available for the concurrent Map Reduce Program during its
>> processing.
>>       In your case a quick solution that comes to my mind is keep your real
>> time data writing into the flume queue/buffer . Set it to a desired size
>> once the queue gets full the data would be dumped into hdfs. Then as per
>> your requirement you can kick off your jobs. If you are running MR jobs on
>> very high frequency then make sure that for every run you have enough data
>> to process and choose your max number of mappers and reducers effectively
>> and  efficiently
>>   Then as the last one, I don't think for normal cases you don't need to
>> dump your large volume of data into lfs and then do a copyFromLocal into
>> hdfs. Tools like flume are build for those purposes I guess. I'm not an
>> expert on Flume, you may need to do more reading on the same before
>> implementing.
>>
>> This what I feel on your use case. But let's leave it open for the experts
>> to comment.
>>
>> Hope it helps.
>> Regards
>> Bejoy K S
>>
>> -----Original Message-----
>> From: Sam Seigal <se...@yahoo.com>
>> Sender: saurabh.r.s@gmail.com
>> Date: Sat, 1 Oct 2011 15:50:46
>> To: <co...@hadoop.apache.org>
>> Reply-To: common-user@hadoop.apache.org
>> Subject: Re: incremental loads into hadoop
>>
>> Hi Bejoy,
>>
>> Thanks for the response.
>>
>> While reading about Hadoop, I have come across threads where people
>> claim that Hadoop is not a good fit for a large amount of small files.
>> It is good for files that are gigabyes/petabytes in size.
>>
>> If I am doing incremental loads, let's say every hour. Do I need to
>> wait until maybe at the end of the day when enough data has been
>> collected to start off a MapReduce job ? I am wondering if an open
>> file that is continuously being written to can at the same time be
>> used as an input to an M/R job ...
>>
>> Also, let's say I did not want to do a load straight off the DB. The
>> service, when committing a transaction to the OLTP system, sends a
>> message for that transaction to  a Hadoop Service that then writes the
>> transaction into HDFS  (the services are connected to each other via a
>> persisted queue, hence are eventually consistent, but that is not a
>> big deal) .. What should I keep in mind while designing a service like
>> this ?
>>
>> Should the file be first written to local disk, and when they reach a
>> large enough size (let us say the cut off is 100G), and then be
>> uploaded into the cluster using put ? or these can be directly written
>> into an HDFS file as the data is streaming in.
>>
>> Thank you for your help.
>>
>>
>> Sam
>>
>> Thank you,
>>
>> Saurabh
>>
>>
>>
>>
>> On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS <be...@gmail.com> wrote:
>> > Sam
>> >      Try looking into Flume if you need to load incremental data into
>> hdfs
>> > . If the source data is present on some JDBC compliant data bases then
>> you
>> > can use SQOOP to get in the data directly into hdfs or hive
>> incrementally.
>> > For Big Data Aggregation and Analytics Hadoop is definitely a good
>> choice,
>> > as you can use Map Reduce or optimized tools on top of map reduce like
>> hive
>> > or pig that would cater the purpose very well. So in short for the two
>> steps
>> > you can go in with the following
>> > 1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
>> > 2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
>> > optimised so go in for a custom map reduce if and only if you feel these
>> > tools don't fit into some complex processing.
>> >
>> > There may be other tools as well to get the source data into hdfs. Let us
>> > leave it open for others to comment.
>> >
>> > Hope It helps.
>> >
>> > Thanks and Regards
>> > Bejoy.K.S
>> >
>> >
>> > On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal <se...@yahoo.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I am relatively new to Hadoop and was wondering how to do incremental
>> >> loads into HDFS.
>> >>
>> >> I have a continuous stream of data flowing into a service which is
>> >> writing to an OLTP store. Due to the high volume of data, we cannot do
>> >> aggregations on the OLTP store, since this starts affecting the write
>> >> performance.
>> >>
>> >> We would like to offload this processing into a Hadoop cluster, mainly
>> >> for doing aggregations/analytics.
>> >>
>> >> The question is how can this continuous stream of data be
>> >> incrementally loaded and processed into Hadoop ?
>> >>
>> >> Thank you,
>> >>
>> >> Sam
>> >>
>> >
>>
>

Re: incremental loads into hadoop

Posted by "Vitthal \"Suhas\" Gogate" <go...@hortonworks.com>.
Agree with Bejoy, although to minimize the processing latency you can still
choose to write more frequently to HDFS resulting into more number of
smaller size files on HDFS rather than waiting to accumulate large size data
before writing to HDFS.  As you may have more number of smaller files, it
may be good to use combine file input format to not have large number of
very small map tasks (one per file if less than block size).  Now after you
process the input data, you may not want to leave these large number of
small files on HDFS and hence  you can use a Hadoop Archive (HAR) tool to
combine and store them into small number of bigger size files.. You can run
this tool periodically in the background to archive the input that is
already processed..  Archive tool itself is implemented as M/R job.

Also to get some level of atomicity, you may copy the data to HDFS at a
temporary location before moving it to final source partition (or
directory).  Existing data loading tools may be doing that already.

--Suhas Gogate


On Sun, Oct 2, 2011 at 11:12 AM, <be...@gmail.com> wrote:

> Sam
>     Your understanding is right, hadoop  definitely works great with large
> volume of data. But not necessarily every file should be in the range of
> Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data,
> It is the total data processed by a map reduce job(rather jobs, most use
> cases uses more than one map reduce job for processing). It can be 10K files
> that make up the whole data.  Why not large number of small files? The over
> head on the name node in housekeeping all these large amount of meta
> data(file- block information) would be huge and there is definitely limits
> to it. But you can store smaller files together in splittable compressed
> formats. In general It is better to keep your file sizes atleast  same or
> more than your hdfs block size. In default it is 64Mb but larger clusters
> have higher values as multiples of 64. If your hdfs block size or your file
> sizes are lesser than the map reduce input split size then it is better
> using InputFormats like CombinedInput Format or so for MR jobs. Usually the
> MR input split size is equal to your hdfs block size. In short as a better
> practice your single file size should be at least equal to one hdfs block
> size.
>
> The approach of keeping a file opened for long to write and then reading
> the same parallely with a  map reduce, I fear it would work. AFAIK it won't.
> When a write is going on some blocks or the file itself would be locked, not
> really sure its the full file being locked or not. In short some blocks
> wouldn't be available for the concurrent Map Reduce Program during its
> processing.
>       In your case a quick solution that comes to my mind is keep your real
> time data writing into the flume queue/buffer . Set it to a desired size
> once the queue gets full the data would be dumped into hdfs. Then as per
> your requirement you can kick off your jobs. If you are running MR jobs on
> very high frequency then make sure that for every run you have enough data
> to process and choose your max number of mappers and reducers effectively
> and  efficiently
>   Then as the last one, I don't think for normal cases you don't need to
> dump your large volume of data into lfs and then do a copyFromLocal into
> hdfs. Tools like flume are build for those purposes I guess. I'm not an
> expert on Flume, you may need to do more reading on the same before
> implementing.
>
> This what I feel on your use case. But let's leave it open for the experts
> to comment.
>
> Hope it helps.
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: Sam Seigal <se...@yahoo.com>
> Sender: saurabh.r.s@gmail.com
> Date: Sat, 1 Oct 2011 15:50:46
> To: <co...@hadoop.apache.org>
> Reply-To: common-user@hadoop.apache.org
> Subject: Re: incremental loads into hadoop
>
> Hi Bejoy,
>
> Thanks for the response.
>
> While reading about Hadoop, I have come across threads where people
> claim that Hadoop is not a good fit for a large amount of small files.
> It is good for files that are gigabyes/petabytes in size.
>
> If I am doing incremental loads, let's say every hour. Do I need to
> wait until maybe at the end of the day when enough data has been
> collected to start off a MapReduce job ? I am wondering if an open
> file that is continuously being written to can at the same time be
> used as an input to an M/R job ...
>
> Also, let's say I did not want to do a load straight off the DB. The
> service, when committing a transaction to the OLTP system, sends a
> message for that transaction to  a Hadoop Service that then writes the
> transaction into HDFS  (the services are connected to each other via a
> persisted queue, hence are eventually consistent, but that is not a
> big deal) .. What should I keep in mind while designing a service like
> this ?
>
> Should the file be first written to local disk, and when they reach a
> large enough size (let us say the cut off is 100G), and then be
> uploaded into the cluster using put ? or these can be directly written
> into an HDFS file as the data is streaming in.
>
> Thank you for your help.
>
>
> Sam
>
> Thank you,
>
> Saurabh
>
>
>
>
> On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS <be...@gmail.com> wrote:
> > Sam
> >      Try looking into Flume if you need to load incremental data into
> hdfs
> > . If the source data is present on some JDBC compliant data bases then
> you
> > can use SQOOP to get in the data directly into hdfs or hive
> incrementally.
> > For Big Data Aggregation and Analytics Hadoop is definitely a good
> choice,
> > as you can use Map Reduce or optimized tools on top of map reduce like
> hive
> > or pig that would cater the purpose very well. So in short for the two
> steps
> > you can go in with the following
> > 1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
> > 2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
> > optimised so go in for a custom map reduce if and only if you feel these
> > tools don't fit into some complex processing.
> >
> > There may be other tools as well to get the source data into hdfs. Let us
> > leave it open for others to comment.
> >
> > Hope It helps.
> >
> > Thanks and Regards
> > Bejoy.K.S
> >
> >
> > On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal <se...@yahoo.com> wrote:
> >
> >> Hi,
> >>
> >> I am relatively new to Hadoop and was wondering how to do incremental
> >> loads into HDFS.
> >>
> >> I have a continuous stream of data flowing into a service which is
> >> writing to an OLTP store. Due to the high volume of data, we cannot do
> >> aggregations on the OLTP store, since this starts affecting the write
> >> performance.
> >>
> >> We would like to offload this processing into a Hadoop cluster, mainly
> >> for doing aggregations/analytics.
> >>
> >> The question is how can this continuous stream of data be
> >> incrementally loaded and processed into Hadoop ?
> >>
> >> Thank you,
> >>
> >> Sam
> >>
> >
>

Re: incremental loads into hadoop

Posted by be...@gmail.com.
Sam
     Your understanding is right, hadoop  definitely works great with large volume of data. But not necessarily every file should be in the range of Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data, It is the total data processed by a map reduce job(rather jobs, most use cases uses more than one map reduce job for processing). It can be 10K files that make up the whole data.  Why not large number of small files? The over head on the name node in housekeeping all these large amount of meta data(file- block information) would be huge and there is definitely limits to it. But you can store smaller files together in splittable compressed formats. In general It is better to keep your file sizes atleast  same or more than your hdfs block size. In default it is 64Mb but larger clusters have higher values as multiples of 64. If your hdfs block size or your file sizes are lesser than the map reduce input split size then it is better using InputFormats like CombinedInput Format or so for MR jobs. Usually the MR input split size is equal to your hdfs block size. In short as a better practice your single file size should be at least equal to one hdfs block size.

The approach of keeping a file opened for long to write and then reading the same parallely with a  map reduce, I fear it would work. AFAIK it won't. When a write is going on some blocks or the file itself would be locked, not really sure its the full file being locked or not. In short some blocks wouldn't be available for the concurrent Map Reduce Program during its processing. 
       In your case a quick solution that comes to my mind is keep your real time data writing into the flume queue/buffer . Set it to a desired size once the queue gets full the data would be dumped into hdfs. Then as per your requirement you can kick off your jobs. If you are running MR jobs on very high frequency then make sure that for every run you have enough data to process and choose your max number of mappers and reducers effectively and  efficiently
   Then as the last one, I don't think for normal cases you don't need to dump your large volume of data into lfs and then do a copyFromLocal into hdfs. Tools like flume are build for those purposes I guess. I'm not an expert on Flume, you may need to do more reading on the same before implementing.

This what I feel on your use case. But let's leave it open for the experts to comment. 

Hope it helps. 
Regards
Bejoy K S

-----Original Message-----
From: Sam Seigal <se...@yahoo.com>
Sender: saurabh.r.s@gmail.com
Date: Sat, 1 Oct 2011 15:50:46 
To: <co...@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: Re: incremental loads into hadoop

Hi Bejoy,

Thanks for the response.

While reading about Hadoop, I have come across threads where people
claim that Hadoop is not a good fit for a large amount of small files.
It is good for files that are gigabyes/petabytes in size.

If I am doing incremental loads, let's say every hour. Do I need to
wait until maybe at the end of the day when enough data has been
collected to start off a MapReduce job ? I am wondering if an open
file that is continuously being written to can at the same time be
used as an input to an M/R job ...

Also, let's say I did not want to do a load straight off the DB. The
service, when committing a transaction to the OLTP system, sends a
message for that transaction to  a Hadoop Service that then writes the
transaction into HDFS  (the services are connected to each other via a
persisted queue, hence are eventually consistent, but that is not a
big deal) .. What should I keep in mind while designing a service like
this ?

Should the file be first written to local disk, and when they reach a
large enough size (let us say the cut off is 100G), and then be
uploaded into the cluster using put ? or these can be directly written
into an HDFS file as the data is streaming in.

Thank you for your help.


Sam

Thank you,

Saurabh




On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS <be...@gmail.com> wrote:
> Sam
>      Try looking into Flume if you need to load incremental data into hdfs
> . If the source data is present on some JDBC compliant data bases then you
> can use SQOOP to get in the data directly into hdfs or hive incrementally.
> For Big Data Aggregation and Analytics Hadoop is definitely a good choice,
> as you can use Map Reduce or optimized tools on top of map reduce like hive
> or pig that would cater the purpose very well. So in short for the two steps
> you can go in with the following
> 1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
> 2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
> optimised so go in for a custom map reduce if and only if you feel these
> tools don't fit into some complex processing.
>
> There may be other tools as well to get the source data into hdfs. Let us
> leave it open for others to comment.
>
> Hope It helps.
>
> Thanks and Regards
> Bejoy.K.S
>
>
> On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal <se...@yahoo.com> wrote:
>
>> Hi,
>>
>> I am relatively new to Hadoop and was wondering how to do incremental
>> loads into HDFS.
>>
>> I have a continuous stream of data flowing into a service which is
>> writing to an OLTP store. Due to the high volume of data, we cannot do
>> aggregations on the OLTP store, since this starts affecting the write
>> performance.
>>
>> We would like to offload this processing into a Hadoop cluster, mainly
>> for doing aggregations/analytics.
>>
>> The question is how can this continuous stream of data be
>> incrementally loaded and processed into Hadoop ?
>>
>> Thank you,
>>
>> Sam
>>
>

Re: incremental loads into hadoop

Posted by Sam Seigal <se...@yahoo.com>.
Hi Bejoy,

Thanks for the response.

While reading about Hadoop, I have come across threads where people
claim that Hadoop is not a good fit for a large amount of small files.
It is good for files that are gigabyes/petabytes in size.

If I am doing incremental loads, let's say every hour. Do I need to
wait until maybe at the end of the day when enough data has been
collected to start off a MapReduce job ? I am wondering if an open
file that is continuously being written to can at the same time be
used as an input to an M/R job ...

Also, let's say I did not want to do a load straight off the DB. The
service, when committing a transaction to the OLTP system, sends a
message for that transaction to  a Hadoop Service that then writes the
transaction into HDFS  (the services are connected to each other via a
persisted queue, hence are eventually consistent, but that is not a
big deal) .. What should I keep in mind while designing a service like
this ?

Should the file be first written to local disk, and when they reach a
large enough size (let us say the cut off is 100G), and then be
uploaded into the cluster using put ? or these can be directly written
into an HDFS file as the data is streaming in.

Thank you for your help.


Sam

Thank you,

Saurabh




On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS <be...@gmail.com> wrote:
> Sam
>      Try looking into Flume if you need to load incremental data into hdfs
> . If the source data is present on some JDBC compliant data bases then you
> can use SQOOP to get in the data directly into hdfs or hive incrementally.
> For Big Data Aggregation and Analytics Hadoop is definitely a good choice,
> as you can use Map Reduce or optimized tools on top of map reduce like hive
> or pig that would cater the purpose very well. So in short for the two steps
> you can go in with the following
> 1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
> 2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
> optimised so go in for a custom map reduce if and only if you feel these
> tools don't fit into some complex processing.
>
> There may be other tools as well to get the source data into hdfs. Let us
> leave it open for others to comment.
>
> Hope It helps.
>
> Thanks and Regards
> Bejoy.K.S
>
>
> On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal <se...@yahoo.com> wrote:
>
>> Hi,
>>
>> I am relatively new to Hadoop and was wondering how to do incremental
>> loads into HDFS.
>>
>> I have a continuous stream of data flowing into a service which is
>> writing to an OLTP store. Due to the high volume of data, we cannot do
>> aggregations on the OLTP store, since this starts affecting the write
>> performance.
>>
>> We would like to offload this processing into a Hadoop cluster, mainly
>> for doing aggregations/analytics.
>>
>> The question is how can this continuous stream of data be
>> incrementally loaded and processed into Hadoop ?
>>
>> Thank you,
>>
>> Sam
>>
>

Re: incremental loads into hadoop

Posted by Bejoy KS <be...@gmail.com>.
Sam
      Try looking into Flume if you need to load incremental data into hdfs
. If the source data is present on some JDBC compliant data bases then you
can use SQOOP to get in the data directly into hdfs or hive incrementally.
For Big Data Aggregation and Analytics Hadoop is definitely a good choice,
as you can use Map Reduce or optimized tools on top of map reduce like hive
or pig that would cater the purpose very well. So in short for the two steps
you can go in with the following
1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
optimised so go in for a custom map reduce if and only if you feel these
tools don't fit into some complex processing.

There may be other tools as well to get the source data into hdfs. Let us
leave it open for others to comment.

Hope It helps.

Thanks and Regards
Bejoy.K.S


On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal <se...@yahoo.com> wrote:

> Hi,
>
> I am relatively new to Hadoop and was wondering how to do incremental
> loads into HDFS.
>
> I have a continuous stream of data flowing into a service which is
> writing to an OLTP store. Due to the high volume of data, we cannot do
> aggregations on the OLTP store, since this starts affecting the write
> performance.
>
> We would like to offload this processing into a Hadoop cluster, mainly
> for doing aggregations/analytics.
>
> The question is how can this continuous stream of data be
> incrementally loaded and processed into Hadoop ?
>
> Thank you,
>
> Sam
>

Re: incremental loads into hadoop

Posted by Sam Seigal <se...@yahoo.com>.
I have given HBase a fair amount of thought, and I am looking for
input. Instead of managing incremental loads myself, why not just
setup an HBase cluster ? What are some of the trade offs.
My primary use for this cluster would still be data
analysis/aggregation and not so much random access. Random access
would be something which is a nice to have in case there are problems,
and we want to examine the data ad-hoc.


On Sat, Oct 1, 2011 at 12:31 PM, in.abdul <in...@gmail.com> wrote:
> There is two method is there for processing OLTP
>
>   1.  Hstremming or scibe  these are only methodes
>   2. if not use chukuwa for storing the data so that when i you got a
>   tesent volume then you can move to HDFS
>
>            Thanks and Regards,
>        S SYED ABDUL KATHER
>                9731841519
>
>
> On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] <
> ml-node+s472066n3383949h45@n3.nabble.com> wrote:
>
>> Hi,
>>
>> I am relatively new to Hadoop and was wondering how to do incremental
>> loads into HDFS.
>>
>> I have a continuous stream of data flowing into a service which is
>> writing to an OLTP store. Due to the high volume of data, we cannot do
>> aggregations on the OLTP store, since this starts affecting the write
>> performance.
>>
>> We would like to offload this processing into a Hadoop cluster, mainly
>> for doing aggregations/analytics.
>>
>> The question is how can this continuous stream of data be
>> incrementally loaded and processed into Hadoop ?
>>
>> Thank you,
>>
>> Sam
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html
>>  To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>.
>>
>>
>
>
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: incremental loads into hadoop

Posted by "in.abdul" <in...@gmail.com>.
There is two method is there for processing OLTP

   1.  Hstremming or scibe  these are only methodes
   2. if not use chukuwa for storing the data so that when i you got a
   tesent volume then you can move to HDFS

            Thanks and Regards,
        S SYED ABDUL KATHER
                9731841519


On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] <
ml-node+s472066n3383949h45@n3.nabble.com> wrote:

> Hi,
>
> I am relatively new to Hadoop and was wondering how to do incremental
> loads into HDFS.
>
> I have a continuous stream of data flowing into a service which is
> writing to an OLTP store. Due to the high volume of data, we cannot do
> aggregations on the OLTP store, since this starts affecting the write
> performance.
>
> We would like to offload this processing into a Hadoop cluster, mainly
> for doing aggregations/analytics.
>
> The question is how can this continuous stream of data be
> incrementally loaded and processed into Hadoop ?
>
> Thank you,
>
> Sam
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html
>  To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>.
>
>


-----
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.