You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Shashidhar Rao <ra...@gmail.com> on 2014/07/19 19:29:48 UTC

Merging small files

Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster
block size is 256 MB but generally if we have to process retail invoice
data , each invoice data is merely let's say 4 KB . Do we merge the invoice
data to make one large file say 1 GB . What is the best practice in this
scenario


Regards
Shashi

Re: Merging small files

Posted by Edward Capriolo <ed...@gmail.com>.

Don't have time to read the thread, but incase it has not been mentioned....

Unleash filecrusher!
https://github.com/edwardcapriolo/filecrush


On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah <
Sambaiah_Kilaru@intuit.com> wrote:

>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
>  You should be looking into sequence file format.
>
>  Thanks,
> Sam
>
>   From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>   You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>>  Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>>  Regards
>>  Shashi
>>
>
>

Re: Merging small files

Posted by Edward Capriolo <ed...@gmail.com>.

Don't have time to read the thread, but incase it has not been mentioned....

Unleash filecrusher!
https://github.com/edwardcapriolo/filecrush


On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah <
Sambaiah_Kilaru@intuit.com> wrote:

>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
>  You should be looking into sequence file format.
>
>  Thanks,
> Sam
>
>   From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>   You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>>  Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>>  Regards
>>  Shashi
>>
>
>

Re: Merging small files

Posted by Edward Capriolo <ed...@gmail.com>.

Don't have time to read the thread, but incase it has not been mentioned....

Unleash filecrusher!
https://github.com/edwardcapriolo/filecrush


On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah <
Sambaiah_Kilaru@intuit.com> wrote:

>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
>  You should be looking into sequence file format.
>
>  Thanks,
> Sam
>
>   From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>   You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>>  Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>>  Regards
>>  Shashi
>>
>
>

Re: Merging small files

Posted by Edward Capriolo <ed...@gmail.com>.

Don't have time to read the thread, but incase it has not been mentioned....

Unleash filecrusher!
https://github.com/edwardcapriolo/filecrush


On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah <
Sambaiah_Kilaru@intuit.com> wrote:

>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
>  You should be looking into sequence file format.
>
>  Thanks,
> Sam
>
>   From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>   You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>>  Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>>  Regards
>>  Shashi
>>
>
>

Re: Data cleansing in modern data architecture

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic.  Of course
HBase approach works too, but I think that would make sense if you have a
large number of bad record frequently, otherwise running a weekly or
nightly scan over you data and reading and writing them back, typically
map/reduce, is what is the conventional way to do it in HDFS.


On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Exception files would only work in the case where a known error is
> thrown. The specific case I was trying to find a solution for is when data
> is the result of bugs in the transactional system or some other system that
> generates data based on human interaction. Here is an example:
>
> Customer Service Reps record interactions with clients through a web
> application.
> There is a bug in the web application such that invoices get double
> entered.
> This double entering goes on for days until it’s discovered by someone in
> accounting.
> We now have to go in an remove those double entries because it’s messing
> up every SUM() function result.
>
> In the old world, it was simply a matter of going in the warehouse and
> blowing away those records. I think the solution we came up with is instead
> of dropping that data into a file, drop it into HBASE where you can do row
> level deletes.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Jens Scheidtmann <je...@gmail.com>
> *Sent:* Monday, August 18, 2014 12:53 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>     Hi Bob,
>
> the answer to your original question depends entirely on the procedures
> and conventions set forth for your data warehouse. So only you can answer
> it.
>
> If you're asking for best practices, it still depends:
> - How large are your files?
> - Have you enough free space for recoding?
> - Are you better off writing an "exception" file?
> - How do you make sure it is always respected?
> - etc.
>
> Best regards,
>
> Jens
>
>

Re: Data cleansing in modern data architecture

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic.  Of course
HBase approach works too, but I think that would make sense if you have a
large number of bad record frequently, otherwise running a weekly or
nightly scan over you data and reading and writing them back, typically
map/reduce, is what is the conventional way to do it in HDFS.


On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Exception files would only work in the case where a known error is
> thrown. The specific case I was trying to find a solution for is when data
> is the result of bugs in the transactional system or some other system that
> generates data based on human interaction. Here is an example:
>
> Customer Service Reps record interactions with clients through a web
> application.
> There is a bug in the web application such that invoices get double
> entered.
> This double entering goes on for days until it’s discovered by someone in
> accounting.
> We now have to go in an remove those double entries because it’s messing
> up every SUM() function result.
>
> In the old world, it was simply a matter of going in the warehouse and
> blowing away those records. I think the solution we came up with is instead
> of dropping that data into a file, drop it into HBASE where you can do row
> level deletes.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Jens Scheidtmann <je...@gmail.com>
> *Sent:* Monday, August 18, 2014 12:53 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>     Hi Bob,
>
> the answer to your original question depends entirely on the procedures
> and conventions set forth for your data warehouse. So only you can answer
> it.
>
> If you're asking for best practices, it still depends:
> - How large are your files?
> - Have you enough free space for recoding?
> - Are you better off writing an "exception" file?
> - How do you make sure it is always respected?
> - etc.
>
> Best regards,
>
> Jens
>
>

Re: Data cleansing in modern data architecture

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic.  Of course
HBase approach works too, but I think that would make sense if you have a
large number of bad record frequently, otherwise running a weekly or
nightly scan over you data and reading and writing them back, typically
map/reduce, is what is the conventional way to do it in HDFS.


On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Exception files would only work in the case where a known error is
> thrown. The specific case I was trying to find a solution for is when data
> is the result of bugs in the transactional system or some other system that
> generates data based on human interaction. Here is an example:
>
> Customer Service Reps record interactions with clients through a web
> application.
> There is a bug in the web application such that invoices get double
> entered.
> This double entering goes on for days until it’s discovered by someone in
> accounting.
> We now have to go in an remove those double entries because it’s messing
> up every SUM() function result.
>
> In the old world, it was simply a matter of going in the warehouse and
> blowing away those records. I think the solution we came up with is instead
> of dropping that data into a file, drop it into HBASE where you can do row
> level deletes.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Jens Scheidtmann <je...@gmail.com>
> *Sent:* Monday, August 18, 2014 12:53 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>     Hi Bob,
>
> the answer to your original question depends entirely on the procedures
> and conventions set forth for your data warehouse. So only you can answer
> it.
>
> If you're asking for best practices, it still depends:
> - How large are your files?
> - Have you enough free space for recoding?
> - Are you better off writing an "exception" file?
> - How do you make sure it is always respected?
> - etc.
>
> Best regards,
>
> Jens
>
>

Re: Data cleansing in modern data architecture

Posted by Peyman Mohajerian <mo...@gmail.com>.

If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic.  Of course
HBase approach works too, but I think that would make sense if you have a
large number of bad record frequently, otherwise running a weekly or
nightly scan over you data and reading and writing them back, typically
map/reduce, is what is the conventional way to do it in HDFS.


On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Exception files would only work in the case where a known error is
> thrown. The specific case I was trying to find a solution for is when data
> is the result of bugs in the transactional system or some other system that
> generates data based on human interaction. Here is an example:
>
> Customer Service Reps record interactions with clients through a web
> application.
> There is a bug in the web application such that invoices get double
> entered.
> This double entering goes on for days until it’s discovered by someone in
> accounting.
> We now have to go in an remove those double entries because it’s messing
> up every SUM() function result.
>
> In the old world, it was simply a matter of going in the warehouse and
> blowing away those records. I think the solution we came up with is instead
> of dropping that data into a file, drop it into HBASE where you can do row
> level deletes.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Jens Scheidtmann <je...@gmail.com>
> *Sent:* Monday, August 18, 2014 12:53 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>     Hi Bob,
>
> the answer to your original question depends entirely on the procedures
> and conventions set forth for your data warehouse. So only you can answer
> it.
>
> If you're asking for best practices, it still depends:
> - How large are your files?
> - Have you enough free space for recoding?
> - Are you better off writing an "exception" file?
> - How do you make sure it is always respected?
> - etc.
>
> Best regards,
>
> Jens
>
>

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Exception files would only work in the case where a known error is thrown. The specific case I was trying to find a solution for is when data is the result of bugs in the transactional system or some other system that generates data based on human interaction. Here is an example:

Customer Service Reps record interactions with clients through a web application.
There is a bug in the web application such that invoices get double entered. 
This double entering goes on for days until it’s discovered by someone in accounting.
We now have to go in an remove those double entries because it’s messing up every SUM() function result.

In the old world, it was simply a matter of going in the warehouse and blowing away those records. I think the solution we came up with is instead of dropping that data into a file, drop it into HBASE where you can do row level deletes.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Jens Scheidtmann 
Sent: Monday, August 18, 2014 12:53 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Hi Bob, 


the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it.


If you're asking for best practices, it still depends:

- How large are your files?

- Have you enough free space for recoding?

- Are you better off writing an "exception" file?

- How do you make sure it is always respected?

- etc.


Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Exception files would only work in the case where a known error is thrown. The specific case I was trying to find a solution for is when data is the result of bugs in the transactional system or some other system that generates data based on human interaction. Here is an example:

Customer Service Reps record interactions with clients through a web application.
There is a bug in the web application such that invoices get double entered. 
This double entering goes on for days until it’s discovered by someone in accounting.
We now have to go in an remove those double entries because it’s messing up every SUM() function result.

In the old world, it was simply a matter of going in the warehouse and blowing away those records. I think the solution we came up with is instead of dropping that data into a file, drop it into HBASE where you can do row level deletes.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Jens Scheidtmann 
Sent: Monday, August 18, 2014 12:53 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Hi Bob, 


the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it.


If you're asking for best practices, it still depends:

- How large are your files?

- Have you enough free space for recoding?

- Are you better off writing an "exception" file?

- How do you make sure it is always respected?

- etc.


Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Exception files would only work in the case where a known error is thrown. The specific case I was trying to find a solution for is when data is the result of bugs in the transactional system or some other system that generates data based on human interaction. Here is an example:

Customer Service Reps record interactions with clients through a web application.
There is a bug in the web application such that invoices get double entered. 
This double entering goes on for days until it’s discovered by someone in accounting.
We now have to go in an remove those double entries because it’s messing up every SUM() function result.

In the old world, it was simply a matter of going in the warehouse and blowing away those records. I think the solution we came up with is instead of dropping that data into a file, drop it into HBASE where you can do row level deletes.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Jens Scheidtmann 
Sent: Monday, August 18, 2014 12:53 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Hi Bob, 


the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it.


If you're asking for best practices, it still depends:

- How large are your files?

- Have you enough free space for recoding?

- Are you better off writing an "exception" file?

- How do you make sure it is always respected?

- etc.


Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Exception files would only work in the case where a known error is thrown. The specific case I was trying to find a solution for is when data is the result of bugs in the transactional system or some other system that generates data based on human interaction. Here is an example:

Customer Service Reps record interactions with clients through a web application.
There is a bug in the web application such that invoices get double entered. 
This double entering goes on for days until it’s discovered by someone in accounting.
We now have to go in an remove those double entries because it’s messing up every SUM() function result.

In the old world, it was simply a matter of going in the warehouse and blowing away those records. I think the solution we came up with is instead of dropping that data into a file, drop it into HBASE where you can do row level deletes.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Jens Scheidtmann 
Sent: Monday, August 18, 2014 12:53 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Hi Bob, 


the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it.


If you're asking for best practices, it still depends:

- How large are your files?

- Have you enough free space for recoding?

- Are you better off writing an "exception" file?

- How do you make sure it is always respected?

- etc.


Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an "exception" file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an "exception" file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an "exception" file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an "exception" file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I quickly went over the wikipedia page for temporal databases. Just sounds like a slowly changing dimension. Data being valid at different points in time isn’t weird to me. Most where clauses off a data warehouse are going to use the date dimension. What we’re talking about is data that was never correct at any time. After the application has been debugged, there is zero reason to retain invoices (or any other object) that should have never existed in the first place. If you were to just create a view that didn’t expose the bad records, it seems to me that the where clause of that view would grow to unmanageable proportions as various data bugs popped up over the life of the application.

I should put an asterisk by all of this and say that I haven’t mastered all of these various applications yet. My current understanding of Hive is that while it’s a “warehouse solution” it’s not actually separate data. It just stores table metadata and the data itself is stored on HDFS correct? That’s confusing because it talks about loading data into tables but then there is also no row level delete functionality.

Side note: There are a lot of books out there on the mechanics of Hadoop and the individual projects, but there is very little on how to actually manage data in Hadoop. 

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Bertrand Dechoux 
Sent: Sunday, August 10, 2014 10:04 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Well, keeping bad data has its use too. I assume you know about temporal database. 

Back to your use case, if you only need to remove a few records from HDFS files, the easiest might be during the reading. It is a view, of course, but it doesn't mean you need to write it back to HDFS. All your data analysis can be represented as a dataflow. Wether you need to save the data at a given point of the flow is your call. The tradeoff is of course between the time gained for later analysis versus the time required to write the state durably (to HDFS).

Bertrand Dechoux



On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <sr...@gmail.com> wrote:

  Ok. If you think, the noise levels in the data is going to be so less, doing the view creation is probably costly and meaningless. 
  HDFS is append-only. So, there's no point writing the transactions as HDFS files and trying to perform analytics on top of it directly.

  Instead, you could go with HBase, with rowkeys being the keys with which you identify and resolve these transactional errors. 
  i.e., if you had a faulty transaction data being logged, it would have and transaction ID, along with associated data. 
  In case, you found out that a particular invoice is offending, you could just remove it from HBase using the transaction ID (rowkey).

  But, if you want to use the same table for running different reports, it might not work out. 
  Because, most of the HBase operations depend on the rowkey and in this table containing transaction ID as the rowkey 
  there's no way your reports would leverage it. So, you need to decide which is a costly operation, removing noise or running _more_ adhoc

  reports and decide to use a different table(which is a view again) for reports, etc.





  On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

    I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

    I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

    So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

    The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

    The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

    Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba
    Twitter: @BobLovesData

    From: Sriram Ramachandrasekaran 
    Sent: Saturday, August 09, 2014 11:55 PM
    To: user@hadoop.apache.org 
    Subject: Re: Data cleansing in modern data architecture

    While, I may not have enough context to your entire processing pipeline, here are my thoughts.

    1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
    2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

    All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

    For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

    Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

    Hope this helps.






    On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 8:55 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 4:01 AM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

      There is a bug in the transactional system.
      The data gets written to HDFS where it winds up in Hive.
      Somebody notices that their report is off/the numbers don’t look right.
      We investigate and find the bug in the transactional system.

      Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Shahab Yunus 
      Sent: Sunday, July 20, 2014 4:20 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

      As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

      It is also dependent on the technology being used and how it manages 'deletion'.

      E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

      But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

      One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

      In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

      Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

      Regards,
      Shahab



      On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

        In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

        My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

        B.





    -- 
    It's just about how deep your longing is!





  -- 
  It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I quickly went over the wikipedia page for temporal databases. Just sounds like a slowly changing dimension. Data being valid at different points in time isn’t weird to me. Most where clauses off a data warehouse are going to use the date dimension. What we’re talking about is data that was never correct at any time. After the application has been debugged, there is zero reason to retain invoices (or any other object) that should have never existed in the first place. If you were to just create a view that didn’t expose the bad records, it seems to me that the where clause of that view would grow to unmanageable proportions as various data bugs popped up over the life of the application.

I should put an asterisk by all of this and say that I haven’t mastered all of these various applications yet. My current understanding of Hive is that while it’s a “warehouse solution” it’s not actually separate data. It just stores table metadata and the data itself is stored on HDFS correct? That’s confusing because it talks about loading data into tables but then there is also no row level delete functionality.

Side note: There are a lot of books out there on the mechanics of Hadoop and the individual projects, but there is very little on how to actually manage data in Hadoop. 

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Bertrand Dechoux 
Sent: Sunday, August 10, 2014 10:04 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Well, keeping bad data has its use too. I assume you know about temporal database. 

Back to your use case, if you only need to remove a few records from HDFS files, the easiest might be during the reading. It is a view, of course, but it doesn't mean you need to write it back to HDFS. All your data analysis can be represented as a dataflow. Wether you need to save the data at a given point of the flow is your call. The tradeoff is of course between the time gained for later analysis versus the time required to write the state durably (to HDFS).

Bertrand Dechoux



On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <sr...@gmail.com> wrote:

  Ok. If you think, the noise levels in the data is going to be so less, doing the view creation is probably costly and meaningless. 
  HDFS is append-only. So, there's no point writing the transactions as HDFS files and trying to perform analytics on top of it directly.

  Instead, you could go with HBase, with rowkeys being the keys with which you identify and resolve these transactional errors. 
  i.e., if you had a faulty transaction data being logged, it would have and transaction ID, along with associated data. 
  In case, you found out that a particular invoice is offending, you could just remove it from HBase using the transaction ID (rowkey).

  But, if you want to use the same table for running different reports, it might not work out. 
  Because, most of the HBase operations depend on the rowkey and in this table containing transaction ID as the rowkey 
  there's no way your reports would leverage it. So, you need to decide which is a costly operation, removing noise or running _more_ adhoc

  reports and decide to use a different table(which is a view again) for reports, etc.





  On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

    I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

    I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

    So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

    The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

    The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

    Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba
    Twitter: @BobLovesData

    From: Sriram Ramachandrasekaran 
    Sent: Saturday, August 09, 2014 11:55 PM
    To: user@hadoop.apache.org 
    Subject: Re: Data cleansing in modern data architecture

    While, I may not have enough context to your entire processing pipeline, here are my thoughts.

    1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
    2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

    All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

    For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

    Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

    Hope this helps.






    On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 8:55 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 4:01 AM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

      There is a bug in the transactional system.
      The data gets written to HDFS where it winds up in Hive.
      Somebody notices that their report is off/the numbers don’t look right.
      We investigate and find the bug in the transactional system.

      Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Shahab Yunus 
      Sent: Sunday, July 20, 2014 4:20 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

      As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

      It is also dependent on the technology being used and how it manages 'deletion'.

      E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

      But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

      One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

      In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

      Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

      Regards,
      Shahab



      On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

        In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

        My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

        B.





    -- 
    It's just about how deep your longing is!





  -- 
  It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I quickly went over the wikipedia page for temporal databases. Just sounds like a slowly changing dimension. Data being valid at different points in time isn’t weird to me. Most where clauses off a data warehouse are going to use the date dimension. What we’re talking about is data that was never correct at any time. After the application has been debugged, there is zero reason to retain invoices (or any other object) that should have never existed in the first place. If you were to just create a view that didn’t expose the bad records, it seems to me that the where clause of that view would grow to unmanageable proportions as various data bugs popped up over the life of the application.

I should put an asterisk by all of this and say that I haven’t mastered all of these various applications yet. My current understanding of Hive is that while it’s a “warehouse solution” it’s not actually separate data. It just stores table metadata and the data itself is stored on HDFS correct? That’s confusing because it talks about loading data into tables but then there is also no row level delete functionality.

Side note: There are a lot of books out there on the mechanics of Hadoop and the individual projects, but there is very little on how to actually manage data in Hadoop. 

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Bertrand Dechoux 
Sent: Sunday, August 10, 2014 10:04 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Well, keeping bad data has its use too. I assume you know about temporal database. 

Back to your use case, if you only need to remove a few records from HDFS files, the easiest might be during the reading. It is a view, of course, but it doesn't mean you need to write it back to HDFS. All your data analysis can be represented as a dataflow. Wether you need to save the data at a given point of the flow is your call. The tradeoff is of course between the time gained for later analysis versus the time required to write the state durably (to HDFS).

Bertrand Dechoux



On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <sr...@gmail.com> wrote:

  Ok. If you think, the noise levels in the data is going to be so less, doing the view creation is probably costly and meaningless. 
  HDFS is append-only. So, there's no point writing the transactions as HDFS files and trying to perform analytics on top of it directly.

  Instead, you could go with HBase, with rowkeys being the keys with which you identify and resolve these transactional errors. 
  i.e., if you had a faulty transaction data being logged, it would have and transaction ID, along with associated data. 
  In case, you found out that a particular invoice is offending, you could just remove it from HBase using the transaction ID (rowkey).

  But, if you want to use the same table for running different reports, it might not work out. 
  Because, most of the HBase operations depend on the rowkey and in this table containing transaction ID as the rowkey 
  there's no way your reports would leverage it. So, you need to decide which is a costly operation, removing noise or running _more_ adhoc

  reports and decide to use a different table(which is a view again) for reports, etc.





  On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

    I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

    I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

    So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

    The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

    The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

    Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba
    Twitter: @BobLovesData

    From: Sriram Ramachandrasekaran 
    Sent: Saturday, August 09, 2014 11:55 PM
    To: user@hadoop.apache.org 
    Subject: Re: Data cleansing in modern data architecture

    While, I may not have enough context to your entire processing pipeline, here are my thoughts.

    1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
    2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

    All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

    For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

    Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

    Hope this helps.






    On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 8:55 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 4:01 AM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

      There is a bug in the transactional system.
      The data gets written to HDFS where it winds up in Hive.
      Somebody notices that their report is off/the numbers don’t look right.
      We investigate and find the bug in the transactional system.

      Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Shahab Yunus 
      Sent: Sunday, July 20, 2014 4:20 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

      As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

      It is also dependent on the technology being used and how it manages 'deletion'.

      E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

      But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

      One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

      In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

      Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

      Regards,
      Shahab



      On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

        In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

        My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

        B.





    -- 
    It's just about how deep your longing is!





  -- 
  It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I quickly went over the wikipedia page for temporal databases. Just sounds like a slowly changing dimension. Data being valid at different points in time isn’t weird to me. Most where clauses off a data warehouse are going to use the date dimension. What we’re talking about is data that was never correct at any time. After the application has been debugged, there is zero reason to retain invoices (or any other object) that should have never existed in the first place. If you were to just create a view that didn’t expose the bad records, it seems to me that the where clause of that view would grow to unmanageable proportions as various data bugs popped up over the life of the application.

I should put an asterisk by all of this and say that I haven’t mastered all of these various applications yet. My current understanding of Hive is that while it’s a “warehouse solution” it’s not actually separate data. It just stores table metadata and the data itself is stored on HDFS correct? That’s confusing because it talks about loading data into tables but then there is also no row level delete functionality.

Side note: There are a lot of books out there on the mechanics of Hadoop and the individual projects, but there is very little on how to actually manage data in Hadoop. 

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Bertrand Dechoux 
Sent: Sunday, August 10, 2014 10:04 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Well, keeping bad data has its use too. I assume you know about temporal database. 

Back to your use case, if you only need to remove a few records from HDFS files, the easiest might be during the reading. It is a view, of course, but it doesn't mean you need to write it back to HDFS. All your data analysis can be represented as a dataflow. Wether you need to save the data at a given point of the flow is your call. The tradeoff is of course between the time gained for later analysis versus the time required to write the state durably (to HDFS).

Bertrand Dechoux



On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <sr...@gmail.com> wrote:

  Ok. If you think, the noise levels in the data is going to be so less, doing the view creation is probably costly and meaningless. 
  HDFS is append-only. So, there's no point writing the transactions as HDFS files and trying to perform analytics on top of it directly.

  Instead, you could go with HBase, with rowkeys being the keys with which you identify and resolve these transactional errors. 
  i.e., if you had a faulty transaction data being logged, it would have and transaction ID, along with associated data. 
  In case, you found out that a particular invoice is offending, you could just remove it from HBase using the transaction ID (rowkey).

  But, if you want to use the same table for running different reports, it might not work out. 
  Because, most of the HBase operations depend on the rowkey and in this table containing transaction ID as the rowkey 
  there's no way your reports would leverage it. So, you need to decide which is a costly operation, removing noise or running _more_ adhoc

  reports and decide to use a different table(which is a view again) for reports, etc.





  On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

    I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

    I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

    So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

    The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

    The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

    Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba
    Twitter: @BobLovesData

    From: Sriram Ramachandrasekaran 
    Sent: Saturday, August 09, 2014 11:55 PM
    To: user@hadoop.apache.org 
    Subject: Re: Data cleansing in modern data architecture

    While, I may not have enough context to your entire processing pipeline, here are my thoughts.

    1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
    2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

    All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

    For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

    Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

    Hope this helps.






    On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 8:55 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 4:01 AM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

      There is a bug in the transactional system.
      The data gets written to HDFS where it winds up in Hive.
      Somebody notices that their report is off/the numbers don’t look right.
      We investigate and find the bug in the transactional system.

      Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Shahab Yunus 
      Sent: Sunday, July 20, 2014 4:20 PM
      To: user@hadoop.apache.org 
      Subject: Re: Data cleansing in modern data architecture

      I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

      As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

      It is also dependent on the technology being used and how it manages 'deletion'.

      E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

      But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

      One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

      In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

      Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

      Regards,
      Shahab



      On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

        In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

        My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

        B.





    -- 
    It's just about how deep your longing is!





  -- 
  It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Bertrand Dechoux <de...@gmail.com>.

Well, keeping bad data has its use too. I assume you know about temporal
database.

Back to your use case, if you only need to remove a few records from HDFS
files, the easiest might be during the reading. It is a view, of course,
but it doesn't mean you need to write it back to HDFS. All your data
analysis can be represented as a dataflow. Wether you need to save the data
at a given point of the flow is your call. The tradeoff is of course
between the time gained for later analysis versus the time required to
write the state durably (to HDFS).

Bertrand Dechoux


On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <
sri.rams85@gmail.com> wrote:

> Ok. If you think, the noise levels in the data is going to be so less,
> doing the view creation is probably costly and meaningless.
> HDFS is append-only. So, there's no point writing the transactions as HDFS
> files and trying to perform analytics on top of it directly.
>
> Instead, you could go with HBase, with rowkeys being the keys with which
> you identify and resolve these transactional errors.
> i.e., if you had a faulty transaction data being logged, it would have and
> transaction ID, along with associated data.
> In case, you found out that a particular invoice is offending, you could
> just remove it from HBase using the transaction ID (rowkey).
>
> But, if you want to use the same table for running different reports, it
> might not work out.
> Because, most of the HBase operations depend on the rowkey and in this
> table containing transaction ID as the rowkey
> there's no way your reports would leverage it. So, you need to decide
> which is a costly operation, removing noise or running _more_ adhoc
> reports and decide to use a different table(which is a view again) for
> reports, etc.
>
>
>
>
> On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It’s a lot of theory right now so let me give you the full background
>> and see if we can more refine the answer.
>>
>> I’ve had a lot of clients with data warehouses that just weren’t
>> functional for various reasons. I’m researching Hadoop to try and figure
>> out a way to totally eliminate traditional data warehouses. I know all the
>> arguments for keeping them around and I’m not impressed with any of them.
>> I’ve noticed for a while that traditional data storage methods just aren’t
>> up to the task for the things we’re asking data to do these days.
>>
>> I’ve got MOST of it figured out. I know how to store and deliver
>> analytics using all the various tools within the Apache project (and some
>> NOT in the Apache project). What I haven’t figured out is how to do data
>> cleansing or master data management both of which are hard to do if you
>> can’t change anything.
>>
>> So let’s say there is a transactional system. It’s a web application that
>> is the businesses main source of revenue. All the activity of the user on
>> the website is easily structured (so basically we’re not dealing with
>> un-structured data). The nature of the data is financial.
>>
>> The pipeline is fairly straight forward. The data is extracted from the
>> transactional system and placed into a Hadoop environment. From there, it’s
>> exposed by Hive so non technical business analyst with SQL skills can  do
>> what they need to do. Pretty typical right?
>>
>> The problem is the web app is not perfect and occasionally produces junk
>> data. Nothing obvious. It may be a few days before the error is noticed. An
>> example would be phantom invoices. Those invoices get in Hadoop. A few days
>> later an analyst notices that the invoice figures for some period are
>> inflated.
>>
>> Once we identify the offending records there is NO reason for them to
>> remain in the system; it’s meaningless junk data. Those records are of zero
>> value. I encounter this scenario in the real world quite often. In the old
>> world, we would just blow away the offending records. Just write a view to
>> skip over a couple of records or exclude a few dozen doesn’t make much
>> sense. It’s better to just blow these records away, I’m just not certain
>> what the best way to accomplish that is in the new world.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
>> *Sent:* Saturday, August 09, 2014 11:55 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  While, I may not have enough context to your entire processing
>> pipeline, here are my thoughts.
>> 1. It's always useful to have raw data, irrespective of if it was right
>> or wrong. The way to look at it is, it's the source of truth at timestamp t.
>> 2. Note that, You only know that the data at timestamp t for an id X was
>> wrong because, subsequent info about X seem to conflict with the one at t
>> or some manual debugging finds it out.
>>
>> All systems that does reporting/analytics is better off by not meddling
>> with the raw data. There should be processed or computed views of this
>> data, that massages it, gets rids of noisy data, merges duplicate entries,
>> etc and then finally produces an output that's suitable for your
>> reports/analytics. So, your idea to write transaction logs to HDFS is
>> fine(unless, you are twisting your systems to get it that way), but, you
>> just need to introduce one more layer of indirection, which has the
>> business logic to handle noise/errors like this.
>>
>> For your specific case, you could've a transaction processor up job which
>> produces a view, that takes care of squashing transactions based on
>> id(something that makes sense in your system) and then handles the business
>> logic of how to handle the bugs/discrepancies in them. Your views could be
>> loaded into a nice columnar store for faster query retrieval(if you have
>> pointed queries - based on a key), else, a different store would be needed.
>> Yes, this has the overhead of running the view creation job, but, I think,
>> the ability to go back to raw data and investigate what happened there is
>> worth it.
>>
>> Your approach of structuring it and storing it in HBase is also fine as
>> long as you keep the concerns separate(if your write/read workloads are
>> poles apart).
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>>> we get around the no editing file rule by dropping structured data into
>>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>>> with that idea?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>>  *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>    Answer: No we can’t get rid of bad records. We have to go back and
>>> rebuild the entire file. We can’t edit records but we can get rid of entire
>>> files right? This would suggest that appending data to files isn’t that
>>> great of an idea. It sounds like it would be more appropriate to cut a
>>> hadoop data load up into periodic files (days, months, etc.) that can
>>> easily be rebuilt should errors occur....
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 4:01 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>   I’m sorry but I have to revisit this again. Going through the reply
>>> below I realized that I didn’t quite get my question answered. Let me be
>>> more explicit with the scenario.
>>>
>>> There is a bug in the transactional system.
>>> The data gets written to HDFS where it winds up in Hive.
>>> Somebody notices that their report is off/the numbers don’t look right.
>>> We investigate and find the bug in the transactional system.
>>>
>>> Question: Can we then go back into HDFS and rid ourselves of the bad
>>> records? If not, what is the recommended course of action?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Sunday, July 20, 2014 4:20 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>  I am assuming you meant the batch jobs that are/were used in old world
>>> for data cleansing.
>>>
>>> As far as I understand there is no hard and fast rule for it and it
>>> depends functional and system requirements of the usecase.
>>>
>>> It is also dependent on the technology being used and how it manages
>>> 'deletion'.
>>>
>>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>>> correct or remove unwanted or incorrect data and than the underlying stores
>>> usually have a concept of compaction which not only defragments data files
>>> but also at this point removes from disk all the entries marked as deleted.
>>>
>>> But there are considerations to be aware of given that compaction is a
>>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>>> there are too much data to be removed. Not only that, in some cases,
>>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>>> normal operations of the data store as well.
>>>
>>> One can also leverage in HBase's case the versioning mechanism and the
>>> afore-mentioned batch job can simply overwrite the same row key and the
>>> previous version would no longer be the latest. If max-version parameter is
>>> configured as 1 then no previous version would be maintained (physically it
>>> would be and would be removed at compaction time but would not be
>>> query-able.)
>>>
>>> In the end, basically cleansing can be done after or before loading but
>>> given the append-only and no hard-delete design approaches of most nosql
>>> stores, I would say it would be easier to do cleaning before data is loaded
>>> in the nosql store. Of course, it bears repeating that it depends on the
>>> use case.
>>>
>>> Having said that, on a side-note and a bit off-topic, it reminds me of
>>> the Lamda Architecture that combines batch and real-time computation for
>>> big data using various technologies and it uses the idea of constant
>>> periodic refreshes to reload the data and within this periodic refresh, the
>>> expectations are that any invalid older data would be corrected and
>>> overwritten by the new refresh load. Those basically the 'batch part' of
>>> the LA takes care of data cleansing by reloading everything. But LA is
>>> mostly for thouse systems which are ok with eventually consistent behavior
>>> and might not be suitable for some systems.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   In the old world, data cleaning used to be a large part of the data
>>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>>> because theoretically you just drop everything in but transactional systems
>>>> that generate the data are still full of bugs and create junk data.
>>>>
>>>> My question is, where does data cleaning/master data management/CDI
>>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>>
>>>> B.
>>>>
>>>
>>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>
>
> --
> It's just about how deep your longing is!
>

Re: Data cleansing in modern data architecture

Posted by Bertrand Dechoux <de...@gmail.com>.

Well, keeping bad data has its use too. I assume you know about temporal
database.

Back to your use case, if you only need to remove a few records from HDFS
files, the easiest might be during the reading. It is a view, of course,
but it doesn't mean you need to write it back to HDFS. All your data
analysis can be represented as a dataflow. Wether you need to save the data
at a given point of the flow is your call. The tradeoff is of course
between the time gained for later analysis versus the time required to
write the state durably (to HDFS).

Bertrand Dechoux


On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <
sri.rams85@gmail.com> wrote:

> Ok. If you think, the noise levels in the data is going to be so less,
> doing the view creation is probably costly and meaningless.
> HDFS is append-only. So, there's no point writing the transactions as HDFS
> files and trying to perform analytics on top of it directly.
>
> Instead, you could go with HBase, with rowkeys being the keys with which
> you identify and resolve these transactional errors.
> i.e., if you had a faulty transaction data being logged, it would have and
> transaction ID, along with associated data.
> In case, you found out that a particular invoice is offending, you could
> just remove it from HBase using the transaction ID (rowkey).
>
> But, if you want to use the same table for running different reports, it
> might not work out.
> Because, most of the HBase operations depend on the rowkey and in this
> table containing transaction ID as the rowkey
> there's no way your reports would leverage it. So, you need to decide
> which is a costly operation, removing noise or running _more_ adhoc
> reports and decide to use a different table(which is a view again) for
> reports, etc.
>
>
>
>
> On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It’s a lot of theory right now so let me give you the full background
>> and see if we can more refine the answer.
>>
>> I’ve had a lot of clients with data warehouses that just weren’t
>> functional for various reasons. I’m researching Hadoop to try and figure
>> out a way to totally eliminate traditional data warehouses. I know all the
>> arguments for keeping them around and I’m not impressed with any of them.
>> I’ve noticed for a while that traditional data storage methods just aren’t
>> up to the task for the things we’re asking data to do these days.
>>
>> I’ve got MOST of it figured out. I know how to store and deliver
>> analytics using all the various tools within the Apache project (and some
>> NOT in the Apache project). What I haven’t figured out is how to do data
>> cleansing or master data management both of which are hard to do if you
>> can’t change anything.
>>
>> So let’s say there is a transactional system. It’s a web application that
>> is the businesses main source of revenue. All the activity of the user on
>> the website is easily structured (so basically we’re not dealing with
>> un-structured data). The nature of the data is financial.
>>
>> The pipeline is fairly straight forward. The data is extracted from the
>> transactional system and placed into a Hadoop environment. From there, it’s
>> exposed by Hive so non technical business analyst with SQL skills can  do
>> what they need to do. Pretty typical right?
>>
>> The problem is the web app is not perfect and occasionally produces junk
>> data. Nothing obvious. It may be a few days before the error is noticed. An
>> example would be phantom invoices. Those invoices get in Hadoop. A few days
>> later an analyst notices that the invoice figures for some period are
>> inflated.
>>
>> Once we identify the offending records there is NO reason for them to
>> remain in the system; it’s meaningless junk data. Those records are of zero
>> value. I encounter this scenario in the real world quite often. In the old
>> world, we would just blow away the offending records. Just write a view to
>> skip over a couple of records or exclude a few dozen doesn’t make much
>> sense. It’s better to just blow these records away, I’m just not certain
>> what the best way to accomplish that is in the new world.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
>> *Sent:* Saturday, August 09, 2014 11:55 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  While, I may not have enough context to your entire processing
>> pipeline, here are my thoughts.
>> 1. It's always useful to have raw data, irrespective of if it was right
>> or wrong. The way to look at it is, it's the source of truth at timestamp t.
>> 2. Note that, You only know that the data at timestamp t for an id X was
>> wrong because, subsequent info about X seem to conflict with the one at t
>> or some manual debugging finds it out.
>>
>> All systems that does reporting/analytics is better off by not meddling
>> with the raw data. There should be processed or computed views of this
>> data, that massages it, gets rids of noisy data, merges duplicate entries,
>> etc and then finally produces an output that's suitable for your
>> reports/analytics. So, your idea to write transaction logs to HDFS is
>> fine(unless, you are twisting your systems to get it that way), but, you
>> just need to introduce one more layer of indirection, which has the
>> business logic to handle noise/errors like this.
>>
>> For your specific case, you could've a transaction processor up job which
>> produces a view, that takes care of squashing transactions based on
>> id(something that makes sense in your system) and then handles the business
>> logic of how to handle the bugs/discrepancies in them. Your views could be
>> loaded into a nice columnar store for faster query retrieval(if you have
>> pointed queries - based on a key), else, a different store would be needed.
>> Yes, this has the overhead of running the view creation job, but, I think,
>> the ability to go back to raw data and investigate what happened there is
>> worth it.
>>
>> Your approach of structuring it and storing it in HBase is also fine as
>> long as you keep the concerns separate(if your write/read workloads are
>> poles apart).
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>>> we get around the no editing file rule by dropping structured data into
>>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>>> with that idea?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>>  *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>    Answer: No we can’t get rid of bad records. We have to go back and
>>> rebuild the entire file. We can’t edit records but we can get rid of entire
>>> files right? This would suggest that appending data to files isn’t that
>>> great of an idea. It sounds like it would be more appropriate to cut a
>>> hadoop data load up into periodic files (days, months, etc.) that can
>>> easily be rebuilt should errors occur....
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 4:01 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>   I’m sorry but I have to revisit this again. Going through the reply
>>> below I realized that I didn’t quite get my question answered. Let me be
>>> more explicit with the scenario.
>>>
>>> There is a bug in the transactional system.
>>> The data gets written to HDFS where it winds up in Hive.
>>> Somebody notices that their report is off/the numbers don’t look right.
>>> We investigate and find the bug in the transactional system.
>>>
>>> Question: Can we then go back into HDFS and rid ourselves of the bad
>>> records? If not, what is the recommended course of action?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Sunday, July 20, 2014 4:20 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>  I am assuming you meant the batch jobs that are/were used in old world
>>> for data cleansing.
>>>
>>> As far as I understand there is no hard and fast rule for it and it
>>> depends functional and system requirements of the usecase.
>>>
>>> It is also dependent on the technology being used and how it manages
>>> 'deletion'.
>>>
>>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>>> correct or remove unwanted or incorrect data and than the underlying stores
>>> usually have a concept of compaction which not only defragments data files
>>> but also at this point removes from disk all the entries marked as deleted.
>>>
>>> But there are considerations to be aware of given that compaction is a
>>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>>> there are too much data to be removed. Not only that, in some cases,
>>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>>> normal operations of the data store as well.
>>>
>>> One can also leverage in HBase's case the versioning mechanism and the
>>> afore-mentioned batch job can simply overwrite the same row key and the
>>> previous version would no longer be the latest. If max-version parameter is
>>> configured as 1 then no previous version would be maintained (physically it
>>> would be and would be removed at compaction time but would not be
>>> query-able.)
>>>
>>> In the end, basically cleansing can be done after or before loading but
>>> given the append-only and no hard-delete design approaches of most nosql
>>> stores, I would say it would be easier to do cleaning before data is loaded
>>> in the nosql store. Of course, it bears repeating that it depends on the
>>> use case.
>>>
>>> Having said that, on a side-note and a bit off-topic, it reminds me of
>>> the Lamda Architecture that combines batch and real-time computation for
>>> big data using various technologies and it uses the idea of constant
>>> periodic refreshes to reload the data and within this periodic refresh, the
>>> expectations are that any invalid older data would be corrected and
>>> overwritten by the new refresh load. Those basically the 'batch part' of
>>> the LA takes care of data cleansing by reloading everything. But LA is
>>> mostly for thouse systems which are ok with eventually consistent behavior
>>> and might not be suitable for some systems.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   In the old world, data cleaning used to be a large part of the data
>>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>>> because theoretically you just drop everything in but transactional systems
>>>> that generate the data are still full of bugs and create junk data.
>>>>
>>>> My question is, where does data cleaning/master data management/CDI
>>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>>
>>>> B.
>>>>
>>>
>>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>
>
> --
> It's just about how deep your longing is!
>

Re: Data cleansing in modern data architecture

Posted by Bertrand Dechoux <de...@gmail.com>.

Well, keeping bad data has its use too. I assume you know about temporal
database.

Back to your use case, if you only need to remove a few records from HDFS
files, the easiest might be during the reading. It is a view, of course,
but it doesn't mean you need to write it back to HDFS. All your data
analysis can be represented as a dataflow. Wether you need to save the data
at a given point of the flow is your call. The tradeoff is of course
between the time gained for later analysis versus the time required to
write the state durably (to HDFS).

Bertrand Dechoux


On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <
sri.rams85@gmail.com> wrote:

> Ok. If you think, the noise levels in the data is going to be so less,
> doing the view creation is probably costly and meaningless.
> HDFS is append-only. So, there's no point writing the transactions as HDFS
> files and trying to perform analytics on top of it directly.
>
> Instead, you could go with HBase, with rowkeys being the keys with which
> you identify and resolve these transactional errors.
> i.e., if you had a faulty transaction data being logged, it would have and
> transaction ID, along with associated data.
> In case, you found out that a particular invoice is offending, you could
> just remove it from HBase using the transaction ID (rowkey).
>
> But, if you want to use the same table for running different reports, it
> might not work out.
> Because, most of the HBase operations depend on the rowkey and in this
> table containing transaction ID as the rowkey
> there's no way your reports would leverage it. So, you need to decide
> which is a costly operation, removing noise or running _more_ adhoc
> reports and decide to use a different table(which is a view again) for
> reports, etc.
>
>
>
>
> On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It’s a lot of theory right now so let me give you the full background
>> and see if we can more refine the answer.
>>
>> I’ve had a lot of clients with data warehouses that just weren’t
>> functional for various reasons. I’m researching Hadoop to try and figure
>> out a way to totally eliminate traditional data warehouses. I know all the
>> arguments for keeping them around and I’m not impressed with any of them.
>> I’ve noticed for a while that traditional data storage methods just aren’t
>> up to the task for the things we’re asking data to do these days.
>>
>> I’ve got MOST of it figured out. I know how to store and deliver
>> analytics using all the various tools within the Apache project (and some
>> NOT in the Apache project). What I haven’t figured out is how to do data
>> cleansing or master data management both of which are hard to do if you
>> can’t change anything.
>>
>> So let’s say there is a transactional system. It’s a web application that
>> is the businesses main source of revenue. All the activity of the user on
>> the website is easily structured (so basically we’re not dealing with
>> un-structured data). The nature of the data is financial.
>>
>> The pipeline is fairly straight forward. The data is extracted from the
>> transactional system and placed into a Hadoop environment. From there, it’s
>> exposed by Hive so non technical business analyst with SQL skills can  do
>> what they need to do. Pretty typical right?
>>
>> The problem is the web app is not perfect and occasionally produces junk
>> data. Nothing obvious. It may be a few days before the error is noticed. An
>> example would be phantom invoices. Those invoices get in Hadoop. A few days
>> later an analyst notices that the invoice figures for some period are
>> inflated.
>>
>> Once we identify the offending records there is NO reason for them to
>> remain in the system; it’s meaningless junk data. Those records are of zero
>> value. I encounter this scenario in the real world quite often. In the old
>> world, we would just blow away the offending records. Just write a view to
>> skip over a couple of records or exclude a few dozen doesn’t make much
>> sense. It’s better to just blow these records away, I’m just not certain
>> what the best way to accomplish that is in the new world.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
>> *Sent:* Saturday, August 09, 2014 11:55 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  While, I may not have enough context to your entire processing
>> pipeline, here are my thoughts.
>> 1. It's always useful to have raw data, irrespective of if it was right
>> or wrong. The way to look at it is, it's the source of truth at timestamp t.
>> 2. Note that, You only know that the data at timestamp t for an id X was
>> wrong because, subsequent info about X seem to conflict with the one at t
>> or some manual debugging finds it out.
>>
>> All systems that does reporting/analytics is better off by not meddling
>> with the raw data. There should be processed or computed views of this
>> data, that massages it, gets rids of noisy data, merges duplicate entries,
>> etc and then finally produces an output that's suitable for your
>> reports/analytics. So, your idea to write transaction logs to HDFS is
>> fine(unless, you are twisting your systems to get it that way), but, you
>> just need to introduce one more layer of indirection, which has the
>> business logic to handle noise/errors like this.
>>
>> For your specific case, you could've a transaction processor up job which
>> produces a view, that takes care of squashing transactions based on
>> id(something that makes sense in your system) and then handles the business
>> logic of how to handle the bugs/discrepancies in them. Your views could be
>> loaded into a nice columnar store for faster query retrieval(if you have
>> pointed queries - based on a key), else, a different store would be needed.
>> Yes, this has the overhead of running the view creation job, but, I think,
>> the ability to go back to raw data and investigate what happened there is
>> worth it.
>>
>> Your approach of structuring it and storing it in HBase is also fine as
>> long as you keep the concerns separate(if your write/read workloads are
>> poles apart).
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>>> we get around the no editing file rule by dropping structured data into
>>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>>> with that idea?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>>  *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>    Answer: No we can’t get rid of bad records. We have to go back and
>>> rebuild the entire file. We can’t edit records but we can get rid of entire
>>> files right? This would suggest that appending data to files isn’t that
>>> great of an idea. It sounds like it would be more appropriate to cut a
>>> hadoop data load up into periodic files (days, months, etc.) that can
>>> easily be rebuilt should errors occur....
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 4:01 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>   I’m sorry but I have to revisit this again. Going through the reply
>>> below I realized that I didn’t quite get my question answered. Let me be
>>> more explicit with the scenario.
>>>
>>> There is a bug in the transactional system.
>>> The data gets written to HDFS where it winds up in Hive.
>>> Somebody notices that their report is off/the numbers don’t look right.
>>> We investigate and find the bug in the transactional system.
>>>
>>> Question: Can we then go back into HDFS and rid ourselves of the bad
>>> records? If not, what is the recommended course of action?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Sunday, July 20, 2014 4:20 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>  I am assuming you meant the batch jobs that are/were used in old world
>>> for data cleansing.
>>>
>>> As far as I understand there is no hard and fast rule for it and it
>>> depends functional and system requirements of the usecase.
>>>
>>> It is also dependent on the technology being used and how it manages
>>> 'deletion'.
>>>
>>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>>> correct or remove unwanted or incorrect data and than the underlying stores
>>> usually have a concept of compaction which not only defragments data files
>>> but also at this point removes from disk all the entries marked as deleted.
>>>
>>> But there are considerations to be aware of given that compaction is a
>>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>>> there are too much data to be removed. Not only that, in some cases,
>>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>>> normal operations of the data store as well.
>>>
>>> One can also leverage in HBase's case the versioning mechanism and the
>>> afore-mentioned batch job can simply overwrite the same row key and the
>>> previous version would no longer be the latest. If max-version parameter is
>>> configured as 1 then no previous version would be maintained (physically it
>>> would be and would be removed at compaction time but would not be
>>> query-able.)
>>>
>>> In the end, basically cleansing can be done after or before loading but
>>> given the append-only and no hard-delete design approaches of most nosql
>>> stores, I would say it would be easier to do cleaning before data is loaded
>>> in the nosql store. Of course, it bears repeating that it depends on the
>>> use case.
>>>
>>> Having said that, on a side-note and a bit off-topic, it reminds me of
>>> the Lamda Architecture that combines batch and real-time computation for
>>> big data using various technologies and it uses the idea of constant
>>> periodic refreshes to reload the data and within this periodic refresh, the
>>> expectations are that any invalid older data would be corrected and
>>> overwritten by the new refresh load. Those basically the 'batch part' of
>>> the LA takes care of data cleansing by reloading everything. But LA is
>>> mostly for thouse systems which are ok with eventually consistent behavior
>>> and might not be suitable for some systems.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   In the old world, data cleaning used to be a large part of the data
>>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>>> because theoretically you just drop everything in but transactional systems
>>>> that generate the data are still full of bugs and create junk data.
>>>>
>>>> My question is, where does data cleaning/master data management/CDI
>>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>>
>>>> B.
>>>>
>>>
>>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>
>
> --
> It's just about how deep your longing is!
>

Re: Data cleansing in modern data architecture

Posted by Bertrand Dechoux <de...@gmail.com>.

Well, keeping bad data has its use too. I assume you know about temporal
database.

Back to your use case, if you only need to remove a few records from HDFS
files, the easiest might be during the reading. It is a view, of course,
but it doesn't mean you need to write it back to HDFS. All your data
analysis can be represented as a dataflow. Wether you need to save the data
at a given point of the flow is your call. The tradeoff is of course
between the time gained for later analysis versus the time required to
write the state durably (to HDFS).

Bertrand Dechoux


On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran <
sri.rams85@gmail.com> wrote:

> Ok. If you think, the noise levels in the data is going to be so less,
> doing the view creation is probably costly and meaningless.
> HDFS is append-only. So, there's no point writing the transactions as HDFS
> files and trying to perform analytics on top of it directly.
>
> Instead, you could go with HBase, with rowkeys being the keys with which
> you identify and resolve these transactional errors.
> i.e., if you had a faulty transaction data being logged, it would have and
> transaction ID, along with associated data.
> In case, you found out that a particular invoice is offending, you could
> just remove it from HBase using the transaction ID (rowkey).
>
> But, if you want to use the same table for running different reports, it
> might not work out.
> Because, most of the HBase operations depend on the rowkey and in this
> table containing transaction ID as the rowkey
> there's no way your reports would leverage it. So, you need to decide
> which is a costly operation, removing noise or running _more_ adhoc
> reports and decide to use a different table(which is a view again) for
> reports, etc.
>
>
>
>
> On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It’s a lot of theory right now so let me give you the full background
>> and see if we can more refine the answer.
>>
>> I’ve had a lot of clients with data warehouses that just weren’t
>> functional for various reasons. I’m researching Hadoop to try and figure
>> out a way to totally eliminate traditional data warehouses. I know all the
>> arguments for keeping them around and I’m not impressed with any of them.
>> I’ve noticed for a while that traditional data storage methods just aren’t
>> up to the task for the things we’re asking data to do these days.
>>
>> I’ve got MOST of it figured out. I know how to store and deliver
>> analytics using all the various tools within the Apache project (and some
>> NOT in the Apache project). What I haven’t figured out is how to do data
>> cleansing or master data management both of which are hard to do if you
>> can’t change anything.
>>
>> So let’s say there is a transactional system. It’s a web application that
>> is the businesses main source of revenue. All the activity of the user on
>> the website is easily structured (so basically we’re not dealing with
>> un-structured data). The nature of the data is financial.
>>
>> The pipeline is fairly straight forward. The data is extracted from the
>> transactional system and placed into a Hadoop environment. From there, it’s
>> exposed by Hive so non technical business analyst with SQL skills can  do
>> what they need to do. Pretty typical right?
>>
>> The problem is the web app is not perfect and occasionally produces junk
>> data. Nothing obvious. It may be a few days before the error is noticed. An
>> example would be phantom invoices. Those invoices get in Hadoop. A few days
>> later an analyst notices that the invoice figures for some period are
>> inflated.
>>
>> Once we identify the offending records there is NO reason for them to
>> remain in the system; it’s meaningless junk data. Those records are of zero
>> value. I encounter this scenario in the real world quite often. In the old
>> world, we would just blow away the offending records. Just write a view to
>> skip over a couple of records or exclude a few dozen doesn’t make much
>> sense. It’s better to just blow these records away, I’m just not certain
>> what the best way to accomplish that is in the new world.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
>> *Sent:* Saturday, August 09, 2014 11:55 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  While, I may not have enough context to your entire processing
>> pipeline, here are my thoughts.
>> 1. It's always useful to have raw data, irrespective of if it was right
>> or wrong. The way to look at it is, it's the source of truth at timestamp t.
>> 2. Note that, You only know that the data at timestamp t for an id X was
>> wrong because, subsequent info about X seem to conflict with the one at t
>> or some manual debugging finds it out.
>>
>> All systems that does reporting/analytics is better off by not meddling
>> with the raw data. There should be processed or computed views of this
>> data, that massages it, gets rids of noisy data, merges duplicate entries,
>> etc and then finally produces an output that's suitable for your
>> reports/analytics. So, your idea to write transaction logs to HDFS is
>> fine(unless, you are twisting your systems to get it that way), but, you
>> just need to introduce one more layer of indirection, which has the
>> business logic to handle noise/errors like this.
>>
>> For your specific case, you could've a transaction processor up job which
>> produces a view, that takes care of squashing transactions based on
>> id(something that makes sense in your system) and then handles the business
>> logic of how to handle the bugs/discrepancies in them. Your views could be
>> loaded into a nice columnar store for faster query retrieval(if you have
>> pointed queries - based on a key), else, a different store would be needed.
>> Yes, this has the overhead of running the view creation job, but, I think,
>> the ability to go back to raw data and investigate what happened there is
>> worth it.
>>
>> Your approach of structuring it and storing it in HBase is also fine as
>> long as you keep the concerns separate(if your write/read workloads are
>> poles apart).
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>>> we get around the no editing file rule by dropping structured data into
>>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>>> with that idea?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>>  *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>    Answer: No we can’t get rid of bad records. We have to go back and
>>> rebuild the entire file. We can’t edit records but we can get rid of entire
>>> files right? This would suggest that appending data to files isn’t that
>>> great of an idea. It sounds like it would be more appropriate to cut a
>>> hadoop data load up into periodic files (days, months, etc.) that can
>>> easily be rebuilt should errors occur....
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>>> *Sent:* Saturday, August 09, 2014 4:01 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>   I’m sorry but I have to revisit this again. Going through the reply
>>> below I realized that I didn’t quite get my question answered. Let me be
>>> more explicit with the scenario.
>>>
>>> There is a bug in the transactional system.
>>> The data gets written to HDFS where it winds up in Hive.
>>> Somebody notices that their report is off/the numbers don’t look right.
>>> We investigate and find the bug in the transactional system.
>>>
>>> Question: Can we then go back into HDFS and rid ourselves of the bad
>>> records? If not, what is the recommended course of action?
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Sunday, July 20, 2014 4:20 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Data cleansing in modern data architecture
>>>
>>>  I am assuming you meant the batch jobs that are/were used in old world
>>> for data cleansing.
>>>
>>> As far as I understand there is no hard and fast rule for it and it
>>> depends functional and system requirements of the usecase.
>>>
>>> It is also dependent on the technology being used and how it manages
>>> 'deletion'.
>>>
>>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>>> correct or remove unwanted or incorrect data and than the underlying stores
>>> usually have a concept of compaction which not only defragments data files
>>> but also at this point removes from disk all the entries marked as deleted.
>>>
>>> But there are considerations to be aware of given that compaction is a
>>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>>> there are too much data to be removed. Not only that, in some cases,
>>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>>> normal operations of the data store as well.
>>>
>>> One can also leverage in HBase's case the versioning mechanism and the
>>> afore-mentioned batch job can simply overwrite the same row key and the
>>> previous version would no longer be the latest. If max-version parameter is
>>> configured as 1 then no previous version would be maintained (physically it
>>> would be and would be removed at compaction time but would not be
>>> query-able.)
>>>
>>> In the end, basically cleansing can be done after or before loading but
>>> given the append-only and no hard-delete design approaches of most nosql
>>> stores, I would say it would be easier to do cleaning before data is loaded
>>> in the nosql store. Of course, it bears repeating that it depends on the
>>> use case.
>>>
>>> Having said that, on a side-note and a bit off-topic, it reminds me of
>>> the Lamda Architecture that combines batch and real-time computation for
>>> big data using various technologies and it uses the idea of constant
>>> periodic refreshes to reload the data and within this periodic refresh, the
>>> expectations are that any invalid older data would be corrected and
>>> overwritten by the new refresh load. Those basically the 'batch part' of
>>> the LA takes care of data cleansing by reloading everything. But LA is
>>> mostly for thouse systems which are ok with eventually consistent behavior
>>> and might not be suitable for some systems.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   In the old world, data cleaning used to be a large part of the data
>>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>>> because theoretically you just drop everything in but transactional systems
>>>> that generate the data are still full of bugs and create junk data.
>>>>
>>>> My question is, where does data cleaning/master data management/CDI
>>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>>
>>>> B.
>>>>
>>>
>>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>
>
> --
> It's just about how deep your longing is!
>

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

Ok. If you think, the noise levels in the data is going to be so less,
doing the view creation is probably costly and meaningless.
HDFS is append-only. So, there's no point writing the transactions as HDFS
files and trying to perform analytics on top of it directly.

Instead, you could go with HBase, with rowkeys being the keys with which
you identify and resolve these transactional errors.
i.e., if you had a faulty transaction data being logged, it would have and
transaction ID, along with associated data.
In case, you found out that a particular invoice is offending, you could
just remove it from HBase using the transaction ID (rowkey).

But, if you want to use the same table for running different reports, it
might not work out.
Because, most of the HBase operations depend on the rowkey and in this
table containing transaction ID as the rowkey
there's no way your reports would leverage it. So, you need to decide which
is a costly operation, removing noise or running _more_ adhoc
reports and decide to use a different table(which is a view again) for
reports, etc.




On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It’s a lot of theory right now so let me give you the full background
> and see if we can more refine the answer.
>
> I’ve had a lot of clients with data warehouses that just weren’t
> functional for various reasons. I’m researching Hadoop to try and figure
> out a way to totally eliminate traditional data warehouses. I know all the
> arguments for keeping them around and I’m not impressed with any of them.
> I’ve noticed for a while that traditional data storage methods just aren’t
> up to the task for the things we’re asking data to do these days.
>
> I’ve got MOST of it figured out. I know how to store and deliver analytics
> using all the various tools within the Apache project (and some NOT in the
> Apache project). What I haven’t figured out is how to do data cleansing or
> master data management both of which are hard to do if you can’t change
> anything.
>
> So let’s say there is a transactional system. It’s a web application that
> is the businesses main source of revenue. All the activity of the user on
> the website is easily structured (so basically we’re not dealing with
> un-structured data). The nature of the data is financial.
>
> The pipeline is fairly straight forward. The data is extracted from the
> transactional system and placed into a Hadoop environment. From there, it’s
> exposed by Hive so non technical business analyst with SQL skills can  do
> what they need to do. Pretty typical right?
>
> The problem is the web app is not perfect and occasionally produces junk
> data. Nothing obvious. It may be a few days before the error is noticed. An
> example would be phantom invoices. Those invoices get in Hadoop. A few days
> later an analyst notices that the invoice figures for some period are
> inflated.
>
> Once we identify the offending records there is NO reason for them to
> remain in the system; it’s meaningless junk data. Those records are of zero
> value. I encounter this scenario in the real world quite often. In the old
> world, we would just blow away the offending records. Just write a view to
> skip over a couple of records or exclude a few dozen doesn’t make much
> sense. It’s better to just blow these records away, I’m just not certain
> what the best way to accomplish that is in the new world.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
> *Sent:* Saturday, August 09, 2014 11:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  While, I may not have enough context to your entire processing pipeline,
> here are my thoughts.
> 1. It's always useful to have raw data, irrespective of if it was right or
> wrong. The way to look at it is, it's the source of truth at timestamp t.
> 2. Note that, You only know that the data at timestamp t for an id X was
> wrong because, subsequent info about X seem to conflict with the one at t
> or some manual debugging finds it out.
>
> All systems that does reporting/analytics is better off by not meddling
> with the raw data. There should be processed or computed views of this
> data, that massages it, gets rids of noisy data, merges duplicate entries,
> etc and then finally produces an output that's suitable for your
> reports/analytics. So, your idea to write transaction logs to HDFS is
> fine(unless, you are twisting your systems to get it that way), but, you
> just need to introduce one more layer of indirection, which has the
> business logic to handle noise/errors like this.
>
> For your specific case, you could've a transaction processor up job which
> produces a view, that takes care of squashing transactions based on
> id(something that makes sense in your system) and then handles the business
> logic of how to handle the bugs/discrepancies in them. Your views could be
> loaded into a nice columnar store for faster query retrieval(if you have
> pointed queries - based on a key), else, a different store would be needed.
> Yes, this has the overhead of running the view creation job, but, I think,
> the ability to go back to raw data and investigate what happened there is
> worth it.
>
> Your approach of structuring it and storing it in HBase is also fine as
> long as you keep the concerns separate(if your write/read workloads are
> poles apart).
>
> Hope this helps.
>
>
>
>
>
> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>> we get around the no editing file rule by dropping structured data into
>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>> with that idea?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>    Answer: No we can’t get rid of bad records. We have to go back and
>> rebuild the entire file. We can’t edit records but we can get rid of entire
>> files right? This would suggest that appending data to files isn’t that
>> great of an idea. It sounds like it would be more appropriate to cut a
>> hadoop data load up into periodic files (days, months, etc.) that can
>> easily be rebuilt should errors occur....
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 4:01 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>   I’m sorry but I have to revisit this again. Going through the reply
>> below I realized that I didn’t quite get my question answered. Let me be
>> more explicit with the scenario.
>>
>> There is a bug in the transactional system.
>> The data gets written to HDFS where it winds up in Hive.
>> Somebody notices that their report is off/the numbers don’t look right.
>> We investigate and find the bug in the transactional system.
>>
>> Question: Can we then go back into HDFS and rid ourselves of the bad
>> records? If not, what is the recommended course of action?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 4:20 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  I am assuming you meant the batch jobs that are/were used in old world
>> for data cleansing.
>>
>> As far as I understand there is no hard and fast rule for it and it
>> depends functional and system requirements of the usecase.
>>
>> It is also dependent on the technology being used and how it manages
>> 'deletion'.
>>
>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>> correct or remove unwanted or incorrect data and than the underlying stores
>> usually have a concept of compaction which not only defragments data files
>> but also at this point removes from disk all the entries marked as deleted.
>>
>> But there are considerations to be aware of given that compaction is a
>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>> there are too much data to be removed. Not only that, in some cases,
>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>> normal operations of the data store as well.
>>
>> One can also leverage in HBase's case the versioning mechanism and the
>> afore-mentioned batch job can simply overwrite the same row key and the
>> previous version would no longer be the latest. If max-version parameter is
>> configured as 1 then no previous version would be maintained (physically it
>> would be and would be removed at compaction time but would not be
>> query-able.)
>>
>> In the end, basically cleansing can be done after or before loading but
>> given the append-only and no hard-delete design approaches of most nosql
>> stores, I would say it would be easier to do cleaning before data is loaded
>> in the nosql store. Of course, it bears repeating that it depends on the
>> use case.
>>
>> Having said that, on a side-note and a bit off-topic, it reminds me of
>> the Lamda Architecture that combines batch and real-time computation for
>> big data using various technologies and it uses the idea of constant
>> periodic refreshes to reload the data and within this periodic refresh, the
>> expectations are that any invalid older data would be corrected and
>> overwritten by the new refresh load. Those basically the 'batch part' of
>> the LA takes care of data cleansing by reloading everything. But LA is
>> mostly for thouse systems which are ok with eventually consistent behavior
>> and might not be suitable for some systems.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   In the old world, data cleaning used to be a large part of the data
>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>> because theoretically you just drop everything in but transactional systems
>>> that generate the data are still full of bugs and create junk data.
>>>
>>> My question is, where does data cleaning/master data management/CDI
>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>
>>> B.
>>>
>>
>>
>
>
>
> --
> It's just about how deep your longing is!
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

Ok. If you think, the noise levels in the data is going to be so less,
doing the view creation is probably costly and meaningless.
HDFS is append-only. So, there's no point writing the transactions as HDFS
files and trying to perform analytics on top of it directly.

Instead, you could go with HBase, with rowkeys being the keys with which
you identify and resolve these transactional errors.
i.e., if you had a faulty transaction data being logged, it would have and
transaction ID, along with associated data.
In case, you found out that a particular invoice is offending, you could
just remove it from HBase using the transaction ID (rowkey).

But, if you want to use the same table for running different reports, it
might not work out.
Because, most of the HBase operations depend on the rowkey and in this
table containing transaction ID as the rowkey
there's no way your reports would leverage it. So, you need to decide which
is a costly operation, removing noise or running _more_ adhoc
reports and decide to use a different table(which is a view again) for
reports, etc.




On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It’s a lot of theory right now so let me give you the full background
> and see if we can more refine the answer.
>
> I’ve had a lot of clients with data warehouses that just weren’t
> functional for various reasons. I’m researching Hadoop to try and figure
> out a way to totally eliminate traditional data warehouses. I know all the
> arguments for keeping them around and I’m not impressed with any of them.
> I’ve noticed for a while that traditional data storage methods just aren’t
> up to the task for the things we’re asking data to do these days.
>
> I’ve got MOST of it figured out. I know how to store and deliver analytics
> using all the various tools within the Apache project (and some NOT in the
> Apache project). What I haven’t figured out is how to do data cleansing or
> master data management both of which are hard to do if you can’t change
> anything.
>
> So let’s say there is a transactional system. It’s a web application that
> is the businesses main source of revenue. All the activity of the user on
> the website is easily structured (so basically we’re not dealing with
> un-structured data). The nature of the data is financial.
>
> The pipeline is fairly straight forward. The data is extracted from the
> transactional system and placed into a Hadoop environment. From there, it’s
> exposed by Hive so non technical business analyst with SQL skills can  do
> what they need to do. Pretty typical right?
>
> The problem is the web app is not perfect and occasionally produces junk
> data. Nothing obvious. It may be a few days before the error is noticed. An
> example would be phantom invoices. Those invoices get in Hadoop. A few days
> later an analyst notices that the invoice figures for some period are
> inflated.
>
> Once we identify the offending records there is NO reason for them to
> remain in the system; it’s meaningless junk data. Those records are of zero
> value. I encounter this scenario in the real world quite often. In the old
> world, we would just blow away the offending records. Just write a view to
> skip over a couple of records or exclude a few dozen doesn’t make much
> sense. It’s better to just blow these records away, I’m just not certain
> what the best way to accomplish that is in the new world.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
> *Sent:* Saturday, August 09, 2014 11:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  While, I may not have enough context to your entire processing pipeline,
> here are my thoughts.
> 1. It's always useful to have raw data, irrespective of if it was right or
> wrong. The way to look at it is, it's the source of truth at timestamp t.
> 2. Note that, You only know that the data at timestamp t for an id X was
> wrong because, subsequent info about X seem to conflict with the one at t
> or some manual debugging finds it out.
>
> All systems that does reporting/analytics is better off by not meddling
> with the raw data. There should be processed or computed views of this
> data, that massages it, gets rids of noisy data, merges duplicate entries,
> etc and then finally produces an output that's suitable for your
> reports/analytics. So, your idea to write transaction logs to HDFS is
> fine(unless, you are twisting your systems to get it that way), but, you
> just need to introduce one more layer of indirection, which has the
> business logic to handle noise/errors like this.
>
> For your specific case, you could've a transaction processor up job which
> produces a view, that takes care of squashing transactions based on
> id(something that makes sense in your system) and then handles the business
> logic of how to handle the bugs/discrepancies in them. Your views could be
> loaded into a nice columnar store for faster query retrieval(if you have
> pointed queries - based on a key), else, a different store would be needed.
> Yes, this has the overhead of running the view creation job, but, I think,
> the ability to go back to raw data and investigate what happened there is
> worth it.
>
> Your approach of structuring it and storing it in HBase is also fine as
> long as you keep the concerns separate(if your write/read workloads are
> poles apart).
>
> Hope this helps.
>
>
>
>
>
> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>> we get around the no editing file rule by dropping structured data into
>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>> with that idea?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>    Answer: No we can’t get rid of bad records. We have to go back and
>> rebuild the entire file. We can’t edit records but we can get rid of entire
>> files right? This would suggest that appending data to files isn’t that
>> great of an idea. It sounds like it would be more appropriate to cut a
>> hadoop data load up into periodic files (days, months, etc.) that can
>> easily be rebuilt should errors occur....
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 4:01 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>   I’m sorry but I have to revisit this again. Going through the reply
>> below I realized that I didn’t quite get my question answered. Let me be
>> more explicit with the scenario.
>>
>> There is a bug in the transactional system.
>> The data gets written to HDFS where it winds up in Hive.
>> Somebody notices that their report is off/the numbers don’t look right.
>> We investigate and find the bug in the transactional system.
>>
>> Question: Can we then go back into HDFS and rid ourselves of the bad
>> records? If not, what is the recommended course of action?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 4:20 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  I am assuming you meant the batch jobs that are/were used in old world
>> for data cleansing.
>>
>> As far as I understand there is no hard and fast rule for it and it
>> depends functional and system requirements of the usecase.
>>
>> It is also dependent on the technology being used and how it manages
>> 'deletion'.
>>
>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>> correct or remove unwanted or incorrect data and than the underlying stores
>> usually have a concept of compaction which not only defragments data files
>> but also at this point removes from disk all the entries marked as deleted.
>>
>> But there are considerations to be aware of given that compaction is a
>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>> there are too much data to be removed. Not only that, in some cases,
>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>> normal operations of the data store as well.
>>
>> One can also leverage in HBase's case the versioning mechanism and the
>> afore-mentioned batch job can simply overwrite the same row key and the
>> previous version would no longer be the latest. If max-version parameter is
>> configured as 1 then no previous version would be maintained (physically it
>> would be and would be removed at compaction time but would not be
>> query-able.)
>>
>> In the end, basically cleansing can be done after or before loading but
>> given the append-only and no hard-delete design approaches of most nosql
>> stores, I would say it would be easier to do cleaning before data is loaded
>> in the nosql store. Of course, it bears repeating that it depends on the
>> use case.
>>
>> Having said that, on a side-note and a bit off-topic, it reminds me of
>> the Lamda Architecture that combines batch and real-time computation for
>> big data using various technologies and it uses the idea of constant
>> periodic refreshes to reload the data and within this periodic refresh, the
>> expectations are that any invalid older data would be corrected and
>> overwritten by the new refresh load. Those basically the 'batch part' of
>> the LA takes care of data cleansing by reloading everything. But LA is
>> mostly for thouse systems which are ok with eventually consistent behavior
>> and might not be suitable for some systems.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   In the old world, data cleaning used to be a large part of the data
>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>> because theoretically you just drop everything in but transactional systems
>>> that generate the data are still full of bugs and create junk data.
>>>
>>> My question is, where does data cleaning/master data management/CDI
>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>
>>> B.
>>>
>>
>>
>
>
>
> --
> It's just about how deep your longing is!
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

Ok. If you think, the noise levels in the data is going to be so less,
doing the view creation is probably costly and meaningless.
HDFS is append-only. So, there's no point writing the transactions as HDFS
files and trying to perform analytics on top of it directly.

Instead, you could go with HBase, with rowkeys being the keys with which
you identify and resolve these transactional errors.
i.e., if you had a faulty transaction data being logged, it would have and
transaction ID, along with associated data.
In case, you found out that a particular invoice is offending, you could
just remove it from HBase using the transaction ID (rowkey).

But, if you want to use the same table for running different reports, it
might not work out.
Because, most of the HBase operations depend on the rowkey and in this
table containing transaction ID as the rowkey
there's no way your reports would leverage it. So, you need to decide which
is a costly operation, removing noise or running _more_ adhoc
reports and decide to use a different table(which is a view again) for
reports, etc.




On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It’s a lot of theory right now so let me give you the full background
> and see if we can more refine the answer.
>
> I’ve had a lot of clients with data warehouses that just weren’t
> functional for various reasons. I’m researching Hadoop to try and figure
> out a way to totally eliminate traditional data warehouses. I know all the
> arguments for keeping them around and I’m not impressed with any of them.
> I’ve noticed for a while that traditional data storage methods just aren’t
> up to the task for the things we’re asking data to do these days.
>
> I’ve got MOST of it figured out. I know how to store and deliver analytics
> using all the various tools within the Apache project (and some NOT in the
> Apache project). What I haven’t figured out is how to do data cleansing or
> master data management both of which are hard to do if you can’t change
> anything.
>
> So let’s say there is a transactional system. It’s a web application that
> is the businesses main source of revenue. All the activity of the user on
> the website is easily structured (so basically we’re not dealing with
> un-structured data). The nature of the data is financial.
>
> The pipeline is fairly straight forward. The data is extracted from the
> transactional system and placed into a Hadoop environment. From there, it’s
> exposed by Hive so non technical business analyst with SQL skills can  do
> what they need to do. Pretty typical right?
>
> The problem is the web app is not perfect and occasionally produces junk
> data. Nothing obvious. It may be a few days before the error is noticed. An
> example would be phantom invoices. Those invoices get in Hadoop. A few days
> later an analyst notices that the invoice figures for some period are
> inflated.
>
> Once we identify the offending records there is NO reason for them to
> remain in the system; it’s meaningless junk data. Those records are of zero
> value. I encounter this scenario in the real world quite often. In the old
> world, we would just blow away the offending records. Just write a view to
> skip over a couple of records or exclude a few dozen doesn’t make much
> sense. It’s better to just blow these records away, I’m just not certain
> what the best way to accomplish that is in the new world.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
> *Sent:* Saturday, August 09, 2014 11:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  While, I may not have enough context to your entire processing pipeline,
> here are my thoughts.
> 1. It's always useful to have raw data, irrespective of if it was right or
> wrong. The way to look at it is, it's the source of truth at timestamp t.
> 2. Note that, You only know that the data at timestamp t for an id X was
> wrong because, subsequent info about X seem to conflict with the one at t
> or some manual debugging finds it out.
>
> All systems that does reporting/analytics is better off by not meddling
> with the raw data. There should be processed or computed views of this
> data, that massages it, gets rids of noisy data, merges duplicate entries,
> etc and then finally produces an output that's suitable for your
> reports/analytics. So, your idea to write transaction logs to HDFS is
> fine(unless, you are twisting your systems to get it that way), but, you
> just need to introduce one more layer of indirection, which has the
> business logic to handle noise/errors like this.
>
> For your specific case, you could've a transaction processor up job which
> produces a view, that takes care of squashing transactions based on
> id(something that makes sense in your system) and then handles the business
> logic of how to handle the bugs/discrepancies in them. Your views could be
> loaded into a nice columnar store for faster query retrieval(if you have
> pointed queries - based on a key), else, a different store would be needed.
> Yes, this has the overhead of running the view creation job, but, I think,
> the ability to go back to raw data and investigate what happened there is
> worth it.
>
> Your approach of structuring it and storing it in HBase is also fine as
> long as you keep the concerns separate(if your write/read workloads are
> poles apart).
>
> Hope this helps.
>
>
>
>
>
> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>> we get around the no editing file rule by dropping structured data into
>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>> with that idea?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>    Answer: No we can’t get rid of bad records. We have to go back and
>> rebuild the entire file. We can’t edit records but we can get rid of entire
>> files right? This would suggest that appending data to files isn’t that
>> great of an idea. It sounds like it would be more appropriate to cut a
>> hadoop data load up into periodic files (days, months, etc.) that can
>> easily be rebuilt should errors occur....
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 4:01 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>   I’m sorry but I have to revisit this again. Going through the reply
>> below I realized that I didn’t quite get my question answered. Let me be
>> more explicit with the scenario.
>>
>> There is a bug in the transactional system.
>> The data gets written to HDFS where it winds up in Hive.
>> Somebody notices that their report is off/the numbers don’t look right.
>> We investigate and find the bug in the transactional system.
>>
>> Question: Can we then go back into HDFS and rid ourselves of the bad
>> records? If not, what is the recommended course of action?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 4:20 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  I am assuming you meant the batch jobs that are/were used in old world
>> for data cleansing.
>>
>> As far as I understand there is no hard and fast rule for it and it
>> depends functional and system requirements of the usecase.
>>
>> It is also dependent on the technology being used and how it manages
>> 'deletion'.
>>
>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>> correct or remove unwanted or incorrect data and than the underlying stores
>> usually have a concept of compaction which not only defragments data files
>> but also at this point removes from disk all the entries marked as deleted.
>>
>> But there are considerations to be aware of given that compaction is a
>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>> there are too much data to be removed. Not only that, in some cases,
>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>> normal operations of the data store as well.
>>
>> One can also leverage in HBase's case the versioning mechanism and the
>> afore-mentioned batch job can simply overwrite the same row key and the
>> previous version would no longer be the latest. If max-version parameter is
>> configured as 1 then no previous version would be maintained (physically it
>> would be and would be removed at compaction time but would not be
>> query-able.)
>>
>> In the end, basically cleansing can be done after or before loading but
>> given the append-only and no hard-delete design approaches of most nosql
>> stores, I would say it would be easier to do cleaning before data is loaded
>> in the nosql store. Of course, it bears repeating that it depends on the
>> use case.
>>
>> Having said that, on a side-note and a bit off-topic, it reminds me of
>> the Lamda Architecture that combines batch and real-time computation for
>> big data using various technologies and it uses the idea of constant
>> periodic refreshes to reload the data and within this periodic refresh, the
>> expectations are that any invalid older data would be corrected and
>> overwritten by the new refresh load. Those basically the 'batch part' of
>> the LA takes care of data cleansing by reloading everything. But LA is
>> mostly for thouse systems which are ok with eventually consistent behavior
>> and might not be suitable for some systems.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   In the old world, data cleaning used to be a large part of the data
>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>> because theoretically you just drop everything in but transactional systems
>>> that generate the data are still full of bugs and create junk data.
>>>
>>> My question is, where does data cleaning/master data management/CDI
>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>
>>> B.
>>>
>>
>>
>
>
>
> --
> It's just about how deep your longing is!
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

Ok. If you think, the noise levels in the data is going to be so less,
doing the view creation is probably costly and meaningless.
HDFS is append-only. So, there's no point writing the transactions as HDFS
files and trying to perform analytics on top of it directly.

Instead, you could go with HBase, with rowkeys being the keys with which
you identify and resolve these transactional errors.
i.e., if you had a faulty transaction data being logged, it would have and
transaction ID, along with associated data.
In case, you found out that a particular invoice is offending, you could
just remove it from HBase using the transaction ID (rowkey).

But, if you want to use the same table for running different reports, it
might not work out.
Because, most of the HBase operations depend on the rowkey and in this
table containing transaction ID as the rowkey
there's no way your reports would leverage it. So, you need to decide which
is a costly operation, removing noise or running _more_ adhoc
reports and decide to use a different table(which is a view again) for
reports, etc.




On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It’s a lot of theory right now so let me give you the full background
> and see if we can more refine the answer.
>
> I’ve had a lot of clients with data warehouses that just weren’t
> functional for various reasons. I’m researching Hadoop to try and figure
> out a way to totally eliminate traditional data warehouses. I know all the
> arguments for keeping them around and I’m not impressed with any of them.
> I’ve noticed for a while that traditional data storage methods just aren’t
> up to the task for the things we’re asking data to do these days.
>
> I’ve got MOST of it figured out. I know how to store and deliver analytics
> using all the various tools within the Apache project (and some NOT in the
> Apache project). What I haven’t figured out is how to do data cleansing or
> master data management both of which are hard to do if you can’t change
> anything.
>
> So let’s say there is a transactional system. It’s a web application that
> is the businesses main source of revenue. All the activity of the user on
> the website is easily structured (so basically we’re not dealing with
> un-structured data). The nature of the data is financial.
>
> The pipeline is fairly straight forward. The data is extracted from the
> transactional system and placed into a Hadoop environment. From there, it’s
> exposed by Hive so non technical business analyst with SQL skills can  do
> what they need to do. Pretty typical right?
>
> The problem is the web app is not perfect and occasionally produces junk
> data. Nothing obvious. It may be a few days before the error is noticed. An
> example would be phantom invoices. Those invoices get in Hadoop. A few days
> later an analyst notices that the invoice figures for some period are
> inflated.
>
> Once we identify the offending records there is NO reason for them to
> remain in the system; it’s meaningless junk data. Those records are of zero
> value. I encounter this scenario in the real world quite often. In the old
> world, we would just blow away the offending records. Just write a view to
> skip over a couple of records or exclude a few dozen doesn’t make much
> sense. It’s better to just blow these records away, I’m just not certain
> what the best way to accomplish that is in the new world.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Sriram Ramachandrasekaran <sr...@gmail.com>
> *Sent:* Saturday, August 09, 2014 11:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  While, I may not have enough context to your entire processing pipeline,
> here are my thoughts.
> 1. It's always useful to have raw data, irrespective of if it was right or
> wrong. The way to look at it is, it's the source of truth at timestamp t.
> 2. Note that, You only know that the data at timestamp t for an id X was
> wrong because, subsequent info about X seem to conflict with the one at t
> or some manual debugging finds it out.
>
> All systems that does reporting/analytics is better off by not meddling
> with the raw data. There should be processed or computed views of this
> data, that massages it, gets rids of noisy data, merges duplicate entries,
> etc and then finally produces an output that's suitable for your
> reports/analytics. So, your idea to write transaction logs to HDFS is
> fine(unless, you are twisting your systems to get it that way), but, you
> just need to introduce one more layer of indirection, which has the
> business logic to handle noise/errors like this.
>
> For your specific case, you could've a transaction processor up job which
> produces a view, that takes care of squashing transactions based on
> id(something that makes sense in your system) and then handles the business
> logic of how to handle the bugs/discrepancies in them. Your views could be
> loaded into a nice columnar store for faster query retrieval(if you have
> pointed queries - based on a key), else, a different store would be needed.
> Yes, this has the overhead of running the view creation job, but, I think,
> the ability to go back to raw data and investigate what happened there is
> worth it.
>
> Your approach of structuring it and storing it in HBase is also fine as
> long as you keep the concerns separate(if your write/read workloads are
> poles apart).
>
> Hope this helps.
>
>
>
>
>
> On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
>> we get around the no editing file rule by dropping structured data into
>> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
>> with that idea?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 8:55 PM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>    Answer: No we can’t get rid of bad records. We have to go back and
>> rebuild the entire file. We can’t edit records but we can get rid of entire
>> files right? This would suggest that appending data to files isn’t that
>> great of an idea. It sounds like it would be more appropriate to cut a
>> hadoop data load up into periodic files (days, months, etc.) that can
>> easily be rebuilt should errors occur....
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
>> *Sent:* Saturday, August 09, 2014 4:01 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>   I’m sorry but I have to revisit this again. Going through the reply
>> below I realized that I didn’t quite get my question answered. Let me be
>> more explicit with the scenario.
>>
>> There is a bug in the transactional system.
>> The data gets written to HDFS where it winds up in Hive.
>> Somebody notices that their report is off/the numbers don’t look right.
>> We investigate and find the bug in the transactional system.
>>
>> Question: Can we then go back into HDFS and rid ourselves of the bad
>> records? If not, what is the recommended course of action?
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 4:20 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Data cleansing in modern data architecture
>>
>>  I am assuming you meant the batch jobs that are/were used in old world
>> for data cleansing.
>>
>> As far as I understand there is no hard and fast rule for it and it
>> depends functional and system requirements of the usecase.
>>
>> It is also dependent on the technology being used and how it manages
>> 'deletion'.
>>
>> E.g. in HBase or Cassandra, you can write batch jobs which clean or
>> correct or remove unwanted or incorrect data and than the underlying stores
>> usually have a concept of compaction which not only defragments data files
>> but also at this point removes from disk all the entries marked as deleted.
>>
>> But there are considerations to be aware of given that compaction is a
>> heavy process and in some cases (e.g. Cassandra) there can be problems when
>> there are too much data to be removed. Not only that, in some cases,
>> marked-to-be-deleted data, until it is deleted/compacted can slow down
>> normal operations of the data store as well.
>>
>> One can also leverage in HBase's case the versioning mechanism and the
>> afore-mentioned batch job can simply overwrite the same row key and the
>> previous version would no longer be the latest. If max-version parameter is
>> configured as 1 then no previous version would be maintained (physically it
>> would be and would be removed at compaction time but would not be
>> query-able.)
>>
>> In the end, basically cleansing can be done after or before loading but
>> given the append-only and no hard-delete design approaches of most nosql
>> stores, I would say it would be easier to do cleaning before data is loaded
>> in the nosql store. Of course, it bears repeating that it depends on the
>> use case.
>>
>> Having said that, on a side-note and a bit off-topic, it reminds me of
>> the Lamda Architecture that combines batch and real-time computation for
>> big data using various technologies and it uses the idea of constant
>> periodic refreshes to reload the data and within this periodic refresh, the
>> expectations are that any invalid older data would be corrected and
>> overwritten by the new refresh load. Those basically the 'batch part' of
>> the LA takes care of data cleansing by reloading everything. But LA is
>> mostly for thouse systems which are ok with eventually consistent behavior
>> and might not be suitable for some systems.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   In the old world, data cleaning used to be a large part of the data
>>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>>> because theoretically you just drop everything in but transactional systems
>>> that generate the data are still full of bugs and create junk data.
>>>
>>> My question is, where does data cleaning/master data management/CDI
>>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>>
>>> B.
>>>
>>
>>
>
>
>
> --
> It's just about how deep your longing is!
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Sriram Ramachandrasekaran 
Sent: Saturday, August 09, 2014 11:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

While, I may not have enough context to your entire processing pipeline, here are my thoughts.

1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

Hope this helps.






On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 8:55 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 4:01 AM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

  There is a bug in the transactional system.
  The data gets written to HDFS where it winds up in Hive.
  Somebody notices that their report is off/the numbers don’t look right.
  We investigate and find the bug in the transactional system.

  Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 4:20 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

  As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

  It is also dependent on the technology being used and how it manages 'deletion'.

  E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

  But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

  One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

  In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

  Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

    My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

    B.





-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Sriram Ramachandrasekaran 
Sent: Saturday, August 09, 2014 11:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

While, I may not have enough context to your entire processing pipeline, here are my thoughts.

1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

Hope this helps.






On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 8:55 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 4:01 AM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

  There is a bug in the transactional system.
  The data gets written to HDFS where it winds up in Hive.
  Somebody notices that their report is off/the numbers don’t look right.
  We investigate and find the bug in the transactional system.

  Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 4:20 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

  As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

  It is also dependent on the technology being used and how it manages 'deletion'.

  E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

  But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

  One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

  In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

  Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

    My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

    B.





-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Sriram Ramachandrasekaran 
Sent: Saturday, August 09, 2014 11:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

While, I may not have enough context to your entire processing pipeline, here are my thoughts.

1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

Hope this helps.






On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 8:55 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 4:01 AM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

  There is a bug in the transactional system.
  The data gets written to HDFS where it winds up in Hive.
  Somebody notices that their report is off/the numbers don’t look right.
  We investigate and find the bug in the transactional system.

  Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 4:20 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

  As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

  It is also dependent on the technology being used and how it manages 'deletion'.

  E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

  But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

  One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

  In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

  Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

    My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

    B.





-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It’s a lot of theory right now so let me give you the full background and see if we can more refine the answer.

I’ve had a lot of clients with data warehouses that just weren’t functional for various reasons. I’m researching Hadoop to try and figure out a way to totally eliminate traditional data warehouses. I know all the arguments for keeping them around and I’m not impressed with any of them. I’ve noticed for a while that traditional data storage methods just aren’t up to the task for the things we’re asking data to do these days.

I’ve got MOST of it figured out. I know how to store and deliver analytics using all the various tools within the Apache project (and some NOT in the Apache project). What I haven’t figured out is how to do data cleansing or master data management both of which are hard to do if you can’t change anything.

So let’s say there is a transactional system. It’s a web application that is the businesses main source of revenue. All the activity of the user on the website is easily structured (so basically we’re not dealing with un-structured data). The nature of the data is financial.

The pipeline is fairly straight forward. The data is extracted from the transactional system and placed into a Hadoop environment. From there, it’s exposed by Hive so non technical business analyst with SQL skills can  do what they need to do. Pretty typical right?

The problem is the web app is not perfect and occasionally produces junk data. Nothing obvious. It may be a few days before the error is noticed. An example would be phantom invoices. Those invoices get in Hadoop. A few days later an analyst notices that the invoice figures for some period are inflated. 

Once we identify the offending records there is NO reason for them to remain in the system; it’s meaningless junk data. Those records are of zero value. I encounter this scenario in the real world quite often. In the old world, we would just blow away the offending records. Just write a view to skip over a couple of records or exclude a few dozen doesn’t make much sense. It’s better to just blow these records away, I’m just not certain what the best way to accomplish that is in the new world.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Sriram Ramachandrasekaran 
Sent: Saturday, August 09, 2014 11:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

While, I may not have enough context to your entire processing pipeline, here are my thoughts.

1. It's always useful to have raw data, irrespective of if it was right or wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was wrong because, subsequent info about X seem to conflict with the one at t or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling with the raw data. There should be processed or computed views of this data, that massages it, gets rids of noisy data, merges duplicate entries, etc and then finally produces an output that's suitable for your reports/analytics. So, your idea to write transaction logs to HDFS is fine(unless, you are twisting your systems to get it that way), but, you just need to introduce one more layer of indirection, which has the business logic to handle noise/errors like this. 

For your specific case, you could've a transaction processor up job which produces a view, that takes care of squashing transactions based on id(something that makes sense in your system) and then handles the business logic of how to handle the bugs/discrepancies in them. Your views could be loaded into a nice columnar store for faster query retrieval(if you have pointed queries - based on a key), else, a different store would be needed. Yes, this has the overhead of running the view creation job, but, I think, the ability to go back to raw data and investigate what happened there is worth it. 

Your approach of structuring it and storing it in HBase is also fine as long as you keep the concerns separate(if your write/read workloads are poles apart).

Hope this helps.






On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 8:55 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Adaryl "Bob" Wakefield, MBA 
  Sent: Saturday, August 09, 2014 4:01 AM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

  There is a bug in the transactional system.
  The data gets written to HDFS where it winds up in Hive.
  Somebody notices that their report is off/the numbers don’t look right.
  We investigate and find the bug in the transactional system.

  Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 4:20 PM
  To: user@hadoop.apache.org 
  Subject: Re: Data cleansing in modern data architecture

  I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

  As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

  It is also dependent on the technology being used and how it manages 'deletion'.

  E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

  But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

  One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

  In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

  Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

    My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

    B.





-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

While, I may not have enough context to your entire processing pipeline,
here are my thoughts.
1. It's always useful to have raw data, irrespective of if it was right or
wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was
wrong because, subsequent info about X seem to conflict with the one at t
or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling
with the raw data. There should be processed or computed views of this
data, that massages it, gets rids of noisy data, merges duplicate entries,
etc and then finally produces an output that's suitable for your
reports/analytics. So, your idea to write transaction logs to HDFS is
fine(unless, you are twisting your systems to get it that way), but, you
just need to introduce one more layer of indirection, which has the
business logic to handle noise/errors like this.

For your specific case, you could've a transaction processor up job which
produces a view, that takes care of squashing transactions based on
id(something that makes sense in your system) and then handles the business
logic of how to handle the bugs/discrepancies in them. Your views could be
loaded into a nice columnar store for faster query retrieval(if you have
pointed queries - based on a key), else, a different store would be needed.
Yes, this has the overhead of running the view creation job, but, I think,
the ability to go back to raw data and investigate what happened there is
worth it.

Your approach of structuring it and storing it in HBase is also fine as
long as you keep the concerns separate(if your write/read workloads are
poles apart).

Hope this helps.





On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
> we get around the no editing file rule by dropping structured data into
> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
> with that idea?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 8:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   Answer: No we can’t get rid of bad records. We have to go back and
> rebuild the entire file. We can’t edit records but we can get rid of entire
> files right? This would suggest that appending data to files isn’t that
> great of an idea. It sounds like it would be more appropriate to cut a
> hadoop data load up into periodic files (days, months, etc.) that can
> easily be rebuilt should errors occur....
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 4:01 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   I’m sorry but I have to revisit this again. Going through the reply
> below I realized that I didn’t quite get my question answered. Let me be
> more explicit with the scenario.
>
> There is a bug in the transactional system.
> The data gets written to HDFS where it winds up in Hive.
> Somebody notices that their report is off/the numbers don’t look right.
> We investigate and find the bug in the transactional system.
>
> Question: Can we then go back into HDFS and rid ourselves of the bad
> records? If not, what is the recommended course of action?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 4:20 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  I am assuming you meant the batch jobs that are/were used in old world
> for data cleansing.
>
> As far as I understand there is no hard and fast rule for it and it
> depends functional and system requirements of the usecase.
>
> It is also dependent on the technology being used and how it manages
> 'deletion'.
>
> E.g. in HBase or Cassandra, you can write batch jobs which clean or
> correct or remove unwanted or incorrect data and than the underlying stores
> usually have a concept of compaction which not only defragments data files
> but also at this point removes from disk all the entries marked as deleted.
>
> But there are considerations to be aware of given that compaction is a
> heavy process and in some cases (e.g. Cassandra) there can be problems when
> there are too much data to be removed. Not only that, in some cases,
> marked-to-be-deleted data, until it is deleted/compacted can slow down
> normal operations of the data store as well.
>
> One can also leverage in HBase's case the versioning mechanism and the
> afore-mentioned batch job can simply overwrite the same row key and the
> previous version would no longer be the latest. If max-version parameter is
> configured as 1 then no previous version would be maintained (physically it
> would be and would be removed at compaction time but would not be
> query-able.)
>
> In the end, basically cleansing can be done after or before loading but
> given the append-only and no hard-delete design approaches of most nosql
> stores, I would say it would be easier to do cleaning before data is loaded
> in the nosql store. Of course, it bears repeating that it depends on the
> use case.
>
> Having said that, on a side-note and a bit off-topic, it reminds me of the
> Lamda Architecture that combines batch and real-time computation for big
> data using various technologies and it uses the idea of constant periodic
> refreshes to reload the data and within this periodic refresh, the
> expectations are that any invalid older data would be corrected and
> overwritten by the new refresh load. Those basically the 'batch part' of
> the LA takes care of data cleansing by reloading everything. But LA is
> mostly for thouse systems which are ok with eventually consistent behavior
> and might not be suitable for some systems.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   In the old world, data cleaning used to be a large part of the data
>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>> because theoretically you just drop everything in but transactional systems
>> that generate the data are still full of bugs and create junk data.
>>
>> My question is, where does data cleaning/master data management/CDI
>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>
>> B.
>>
>
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

While, I may not have enough context to your entire processing pipeline,
here are my thoughts.
1. It's always useful to have raw data, irrespective of if it was right or
wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was
wrong because, subsequent info about X seem to conflict with the one at t
or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling
with the raw data. There should be processed or computed views of this
data, that massages it, gets rids of noisy data, merges duplicate entries,
etc and then finally produces an output that's suitable for your
reports/analytics. So, your idea to write transaction logs to HDFS is
fine(unless, you are twisting your systems to get it that way), but, you
just need to introduce one more layer of indirection, which has the
business logic to handle noise/errors like this.

For your specific case, you could've a transaction processor up job which
produces a view, that takes care of squashing transactions based on
id(something that makes sense in your system) and then handles the business
logic of how to handle the bugs/discrepancies in them. Your views could be
loaded into a nice columnar store for faster query retrieval(if you have
pointed queries - based on a key), else, a different store would be needed.
Yes, this has the overhead of running the view creation job, but, I think,
the ability to go back to raw data and investigate what happened there is
worth it.

Your approach of structuring it and storing it in HBase is also fine as
long as you keep the concerns separate(if your write/read workloads are
poles apart).

Hope this helps.





On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
> we get around the no editing file rule by dropping structured data into
> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
> with that idea?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 8:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   Answer: No we can’t get rid of bad records. We have to go back and
> rebuild the entire file. We can’t edit records but we can get rid of entire
> files right? This would suggest that appending data to files isn’t that
> great of an idea. It sounds like it would be more appropriate to cut a
> hadoop data load up into periodic files (days, months, etc.) that can
> easily be rebuilt should errors occur....
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 4:01 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   I’m sorry but I have to revisit this again. Going through the reply
> below I realized that I didn’t quite get my question answered. Let me be
> more explicit with the scenario.
>
> There is a bug in the transactional system.
> The data gets written to HDFS where it winds up in Hive.
> Somebody notices that their report is off/the numbers don’t look right.
> We investigate and find the bug in the transactional system.
>
> Question: Can we then go back into HDFS and rid ourselves of the bad
> records? If not, what is the recommended course of action?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 4:20 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  I am assuming you meant the batch jobs that are/were used in old world
> for data cleansing.
>
> As far as I understand there is no hard and fast rule for it and it
> depends functional and system requirements of the usecase.
>
> It is also dependent on the technology being used and how it manages
> 'deletion'.
>
> E.g. in HBase or Cassandra, you can write batch jobs which clean or
> correct or remove unwanted or incorrect data and than the underlying stores
> usually have a concept of compaction which not only defragments data files
> but also at this point removes from disk all the entries marked as deleted.
>
> But there are considerations to be aware of given that compaction is a
> heavy process and in some cases (e.g. Cassandra) there can be problems when
> there are too much data to be removed. Not only that, in some cases,
> marked-to-be-deleted data, until it is deleted/compacted can slow down
> normal operations of the data store as well.
>
> One can also leverage in HBase's case the versioning mechanism and the
> afore-mentioned batch job can simply overwrite the same row key and the
> previous version would no longer be the latest. If max-version parameter is
> configured as 1 then no previous version would be maintained (physically it
> would be and would be removed at compaction time but would not be
> query-able.)
>
> In the end, basically cleansing can be done after or before loading but
> given the append-only and no hard-delete design approaches of most nosql
> stores, I would say it would be easier to do cleaning before data is loaded
> in the nosql store. Of course, it bears repeating that it depends on the
> use case.
>
> Having said that, on a side-note and a bit off-topic, it reminds me of the
> Lamda Architecture that combines batch and real-time computation for big
> data using various technologies and it uses the idea of constant periodic
> refreshes to reload the data and within this periodic refresh, the
> expectations are that any invalid older data would be corrected and
> overwritten by the new refresh load. Those basically the 'batch part' of
> the LA takes care of data cleansing by reloading everything. But LA is
> mostly for thouse systems which are ok with eventually consistent behavior
> and might not be suitable for some systems.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   In the old world, data cleaning used to be a large part of the data
>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>> because theoretically you just drop everything in but transactional systems
>> that generate the data are still full of bugs and create junk data.
>>
>> My question is, where does data cleaning/master data management/CDI
>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>
>> B.
>>
>
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

While, I may not have enough context to your entire processing pipeline,
here are my thoughts.
1. It's always useful to have raw data, irrespective of if it was right or
wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was
wrong because, subsequent info about X seem to conflict with the one at t
or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling
with the raw data. There should be processed or computed views of this
data, that massages it, gets rids of noisy data, merges duplicate entries,
etc and then finally produces an output that's suitable for your
reports/analytics. So, your idea to write transaction logs to HDFS is
fine(unless, you are twisting your systems to get it that way), but, you
just need to introduce one more layer of indirection, which has the
business logic to handle noise/errors like this.

For your specific case, you could've a transaction processor up job which
produces a view, that takes care of squashing transactions based on
id(something that makes sense in your system) and then handles the business
logic of how to handle the bugs/discrepancies in them. Your views could be
loaded into a nice columnar store for faster query retrieval(if you have
pointed queries - based on a key), else, a different store would be needed.
Yes, this has the overhead of running the view creation job, but, I think,
the ability to go back to raw data and investigate what happened there is
worth it.

Your approach of structuring it and storing it in HBase is also fine as
long as you keep the concerns separate(if your write/read workloads are
poles apart).

Hope this helps.





On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
> we get around the no editing file rule by dropping structured data into
> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
> with that idea?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 8:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   Answer: No we can’t get rid of bad records. We have to go back and
> rebuild the entire file. We can’t edit records but we can get rid of entire
> files right? This would suggest that appending data to files isn’t that
> great of an idea. It sounds like it would be more appropriate to cut a
> hadoop data load up into periodic files (days, months, etc.) that can
> easily be rebuilt should errors occur....
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 4:01 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   I’m sorry but I have to revisit this again. Going through the reply
> below I realized that I didn’t quite get my question answered. Let me be
> more explicit with the scenario.
>
> There is a bug in the transactional system.
> The data gets written to HDFS where it winds up in Hive.
> Somebody notices that their report is off/the numbers don’t look right.
> We investigate and find the bug in the transactional system.
>
> Question: Can we then go back into HDFS and rid ourselves of the bad
> records? If not, what is the recommended course of action?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 4:20 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  I am assuming you meant the batch jobs that are/were used in old world
> for data cleansing.
>
> As far as I understand there is no hard and fast rule for it and it
> depends functional and system requirements of the usecase.
>
> It is also dependent on the technology being used and how it manages
> 'deletion'.
>
> E.g. in HBase or Cassandra, you can write batch jobs which clean or
> correct or remove unwanted or incorrect data and than the underlying stores
> usually have a concept of compaction which not only defragments data files
> but also at this point removes from disk all the entries marked as deleted.
>
> But there are considerations to be aware of given that compaction is a
> heavy process and in some cases (e.g. Cassandra) there can be problems when
> there are too much data to be removed. Not only that, in some cases,
> marked-to-be-deleted data, until it is deleted/compacted can slow down
> normal operations of the data store as well.
>
> One can also leverage in HBase's case the versioning mechanism and the
> afore-mentioned batch job can simply overwrite the same row key and the
> previous version would no longer be the latest. If max-version parameter is
> configured as 1 then no previous version would be maintained (physically it
> would be and would be removed at compaction time but would not be
> query-able.)
>
> In the end, basically cleansing can be done after or before loading but
> given the append-only and no hard-delete design approaches of most nosql
> stores, I would say it would be easier to do cleaning before data is loaded
> in the nosql store. Of course, it bears repeating that it depends on the
> use case.
>
> Having said that, on a side-note and a bit off-topic, it reminds me of the
> Lamda Architecture that combines batch and real-time computation for big
> data using various technologies and it uses the idea of constant periodic
> refreshes to reload the data and within this periodic refresh, the
> expectations are that any invalid older data would be corrected and
> overwritten by the new refresh load. Those basically the 'batch part' of
> the LA takes care of data cleansing by reloading everything. But LA is
> mostly for thouse systems which are ok with eventually consistent behavior
> and might not be suitable for some systems.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   In the old world, data cleaning used to be a large part of the data
>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>> because theoretically you just drop everything in but transactional systems
>> that generate the data are still full of bugs and create junk data.
>>
>> My question is, where does data cleaning/master data management/CDI
>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>
>> B.
>>
>
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.

While, I may not have enough context to your entire processing pipeline,
here are my thoughts.
1. It's always useful to have raw data, irrespective of if it was right or
wrong. The way to look at it is, it's the source of truth at timestamp t.
2. Note that, You only know that the data at timestamp t for an id X was
wrong because, subsequent info about X seem to conflict with the one at t
or some manual debugging finds it out.

All systems that does reporting/analytics is better off by not meddling
with the raw data. There should be processed or computed views of this
data, that massages it, gets rids of noisy data, merges duplicate entries,
etc and then finally produces an output that's suitable for your
reports/analytics. So, your idea to write transaction logs to HDFS is
fine(unless, you are twisting your systems to get it that way), but, you
just need to introduce one more layer of indirection, which has the
business logic to handle noise/errors like this.

For your specific case, you could've a transaction processor up job which
produces a view, that takes care of squashing transactions based on
id(something that makes sense in your system) and then handles the business
logic of how to handle the bugs/discrepancies in them. Your views could be
loaded into a nice columnar store for faster query retrieval(if you have
pointed queries - based on a key), else, a different store would be needed.
Yes, this has the overhead of running the view creation job, but, I think,
the ability to go back to raw data and investigate what happened there is
worth it.

Your approach of structuring it and storing it in HBase is also fine as
long as you keep the concerns separate(if your write/read workloads are
poles apart).

Hope this helps.





On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Or...as an alternative, since HBASE uses HDFS to store it’s data, can
> we get around the no editing file rule by dropping structured data into
> HBASE? That way, we have data in HDFS that can be deleted. Any real problem
> with that idea?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 8:55 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   Answer: No we can’t get rid of bad records. We have to go back and
> rebuild the entire file. We can’t edit records but we can get rid of entire
> files right? This would suggest that appending data to files isn’t that
> great of an idea. It sounds like it would be more appropriate to cut a
> hadoop data load up into periodic files (days, months, etc.) that can
> easily be rebuilt should errors occur....
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>
> *Sent:* Saturday, August 09, 2014 4:01 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>   I’m sorry but I have to revisit this again. Going through the reply
> below I realized that I didn’t quite get my question answered. Let me be
> more explicit with the scenario.
>
> There is a bug in the transactional system.
> The data gets written to HDFS where it winds up in Hive.
> Somebody notices that their report is off/the numbers don’t look right.
> We investigate and find the bug in the transactional system.
>
> Question: Can we then go back into HDFS and rid ourselves of the bad
> records? If not, what is the recommended course of action?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 4:20 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Data cleansing in modern data architecture
>
>  I am assuming you meant the batch jobs that are/were used in old world
> for data cleansing.
>
> As far as I understand there is no hard and fast rule for it and it
> depends functional and system requirements of the usecase.
>
> It is also dependent on the technology being used and how it manages
> 'deletion'.
>
> E.g. in HBase or Cassandra, you can write batch jobs which clean or
> correct or remove unwanted or incorrect data and than the underlying stores
> usually have a concept of compaction which not only defragments data files
> but also at this point removes from disk all the entries marked as deleted.
>
> But there are considerations to be aware of given that compaction is a
> heavy process and in some cases (e.g. Cassandra) there can be problems when
> there are too much data to be removed. Not only that, in some cases,
> marked-to-be-deleted data, until it is deleted/compacted can slow down
> normal operations of the data store as well.
>
> One can also leverage in HBase's case the versioning mechanism and the
> afore-mentioned batch job can simply overwrite the same row key and the
> previous version would no longer be the latest. If max-version parameter is
> configured as 1 then no previous version would be maintained (physically it
> would be and would be removed at compaction time but would not be
> query-able.)
>
> In the end, basically cleansing can be done after or before loading but
> given the append-only and no hard-delete design approaches of most nosql
> stores, I would say it would be easier to do cleaning before data is loaded
> in the nosql store. Of course, it bears repeating that it depends on the
> use case.
>
> Having said that, on a side-note and a bit off-topic, it reminds me of the
> Lamda Architecture that combines batch and real-time computation for big
> data using various technologies and it uses the idea of constant periodic
> refreshes to reload the data and within this periodic refresh, the
> expectations are that any invalid older data would be corrected and
> overwritten by the new refresh load. Those basically the 'batch part' of
> the LA takes care of data cleansing by reloading everything. But LA is
> mostly for thouse systems which are ok with eventually consistent behavior
> and might not be suitable for some systems.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   In the old world, data cleaning used to be a large part of the data
>> warehouse load. Now that we’re working in a schemaless environment, I’m not
>> sure where data cleansing is supposed to take place. NoSQL sounds fun
>> because theoretically you just drop everything in but transactional systems
>> that generate the data are still full of bugs and create junk data.
>>
>> My question is, where does data cleaning/master data management/CDI
>> belong in a modern data architecture? Before it hit hits Hadoop? After?
>>
>> B.
>>
>
>



-- 
It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 8:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 8:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 8:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Or...as an alternative, since HBASE uses HDFS to store it’s data, can we get around the no editing file rule by dropping structured data into HBASE? That way, we have data in HDFS that can be deleted. Any real problem with that idea?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 8:55 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Answer: No we can’t get rid of bad records. We have to go back and rebuild the entire file. We can’t edit records but we can get rid of entire files right? This would suggest that appending data to files isn’t that great of an idea. It sounds like it would be more appropriate to cut a hadoop data load up into periodic files (days, months, etc.) that can easily be rebuilt should errors occur....

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Adaryl "Bob" Wakefield, MBA 
Sent: Saturday, August 09, 2014 4:01 AM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I’m sorry but I have to revisit this again. Going through the reply below I realized that I didn’t quite get my question answered. Let me be more explicit with the scenario.

There is a bug in the transactional system.
The data gets written to HDFS where it winds up in Hive.
Somebody notices that their report is off/the numbers don’t look right.
We investigate and find the bug in the transactional system.

Question: Can we then go back into HDFS and rid ourselves of the bad records? If not, what is the recommended course of action?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 4:20 PM
To: user@hadoop.apache.org 
Subject: Re: Data cleansing in modern data architecture

I am assuming you meant the batch jobs that are/were used in old world for data cleansing. 

As far as I understand there is no hard and fast rule for it and it depends functional and system requirements of the usecase. 

It is also dependent on the technology being used and how it manages 'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct or remove unwanted or incorrect data and than the underlying stores usually have a concept of compaction which not only defragments data files but also at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a heavy process and in some cases (e.g. Cassandra) there can be problems when there are too much data to be removed. Not only that, in some cases, marked-to-be-deleted data, until it is deleted/compacted can slow down normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the afore-mentioned batch job can simply overwrite the same row key and the previous version would no longer be the latest. If max-version parameter is configured as 1 then no previous version would be maintained (physically it would be and would be removed at compaction time but would not be query-able.)

In the end, basically cleansing can be done after or before loading but given the append-only and no hard-delete design approaches of most nosql stores, I would say it would be easier to do cleaning before data is loaded in the nosql store. Of course, it bears repeating that it depends on the use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the Lamda Architecture that combines batch and real-time computation for big data using various technologies and it uses the idea of constant periodic refreshes to reload the data and within this periodic refresh, the expectations are that any invalid older data would be corrected and overwritten by the new refresh load. Those basically the 'batch part' of the LA takes care of data cleansing by reloading everything. But LA is mostly for thouse systems which are ok with eventually consistent behavior and might not be suitable for some systems.

Regards,
Shahab



On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

  My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

  B.

Re: Data cleansing in modern data architecture

Posted by Shahab Yunus <sh...@gmail.com>.

I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.

As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.

It is also dependent on the technology being used and how it manages
'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct
or remove unwanted or incorrect data and than the underlying stores usually
have a concept of compaction which not only defragments data files but also
at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a
heavy process and in some cases (e.g. Cassandra) there can be problems when
there are too much data to be removed. Not only that, in some cases,
marked-to-be-deleted data, until it is deleted/compacted can slow down
normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the
afore-mentioned batch job can simply overwrite the same row key and the
previous version would no longer be the latest. If max-version parameter is
configured as 1 then no previous version would be maintained (physically it
would be and would be removed at compaction time but would not be
query-able.)

In the end, basically cleansing can be done after or before loading but
given the append-only and no hard-delete design approaches of most nosql
stores, I would say it would be easier to do cleaning before data is loaded
in the nosql store. Of course, it bears repeating that it depends on the
use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the
Lamda Architecture that combines batch and real-time computation for big
data using various technologies and it uses the idea of constant periodic
refreshes to reload the data and within this periodic refresh, the
expectations are that any invalid older data would be corrected and
overwritten by the new refresh load. Those basically the 'batch part' of
the LA takes care of data cleansing by reloading everything. But LA is
mostly for thouse systems which are ok with eventually consistent behavior
and might not be suitable for some systems.

Regards,
Shahab

On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   In the old world, data cleaning used to be a large part of the data
> warehouse load. Now that we’re working in a schemaless environment, I’m not
> sure where data cleansing is supposed to take place. NoSQL sounds fun
> because theoretically you just drop everything in but transactional systems
> that generate the data are still full of bugs and create junk data.
>
> My question is, where does data cleaning/master data management/CDI belong
> in a modern data architecture? Before it hit hits Hadoop? After?
>
> B.
>

Re: Data cleansing in modern data architecture

Posted by Shahab Yunus <sh...@gmail.com>.

I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.

As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.

It is also dependent on the technology being used and how it manages
'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct
or remove unwanted or incorrect data and than the underlying stores usually
have a concept of compaction which not only defragments data files but also
at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a
heavy process and in some cases (e.g. Cassandra) there can be problems when
there are too much data to be removed. Not only that, in some cases,
marked-to-be-deleted data, until it is deleted/compacted can slow down
normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the
afore-mentioned batch job can simply overwrite the same row key and the
previous version would no longer be the latest. If max-version parameter is
configured as 1 then no previous version would be maintained (physically it
would be and would be removed at compaction time but would not be
query-able.)

In the end, basically cleansing can be done after or before loading but
given the append-only and no hard-delete design approaches of most nosql
stores, I would say it would be easier to do cleaning before data is loaded
in the nosql store. Of course, it bears repeating that it depends on the
use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the
Lamda Architecture that combines batch and real-time computation for big
data using various technologies and it uses the idea of constant periodic
refreshes to reload the data and within this periodic refresh, the
expectations are that any invalid older data would be corrected and
overwritten by the new refresh load. Those basically the 'batch part' of
the LA takes care of data cleansing by reloading everything. But LA is
mostly for thouse systems which are ok with eventually consistent behavior
and might not be suitable for some systems.

Regards,
Shahab

On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   In the old world, data cleaning used to be a large part of the data
> warehouse load. Now that we’re working in a schemaless environment, I’m not
> sure where data cleansing is supposed to take place. NoSQL sounds fun
> because theoretically you just drop everything in but transactional systems
> that generate the data are still full of bugs and create junk data.
>
> My question is, where does data cleaning/master data management/CDI belong
> in a modern data architecture? Before it hit hits Hadoop? After?
>
> B.
>

Re: Data cleansing in modern data architecture

Posted by Shahab Yunus <sh...@gmail.com>.

I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.

As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.

It is also dependent on the technology being used and how it manages
'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct
or remove unwanted or incorrect data and than the underlying stores usually
have a concept of compaction which not only defragments data files but also
at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a
heavy process and in some cases (e.g. Cassandra) there can be problems when
there are too much data to be removed. Not only that, in some cases,
marked-to-be-deleted data, until it is deleted/compacted can slow down
normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the
afore-mentioned batch job can simply overwrite the same row key and the
previous version would no longer be the latest. If max-version parameter is
configured as 1 then no previous version would be maintained (physically it
would be and would be removed at compaction time but would not be
query-able.)

In the end, basically cleansing can be done after or before loading but
given the append-only and no hard-delete design approaches of most nosql
stores, I would say it would be easier to do cleaning before data is loaded
in the nosql store. Of course, it bears repeating that it depends on the
use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the
Lamda Architecture that combines batch and real-time computation for big
data using various technologies and it uses the idea of constant periodic
refreshes to reload the data and within this periodic refresh, the
expectations are that any invalid older data would be corrected and
overwritten by the new refresh load. Those basically the 'batch part' of
the LA takes care of data cleansing by reloading everything. But LA is
mostly for thouse systems which are ok with eventually consistent behavior
and might not be suitable for some systems.

Regards,
Shahab

On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   In the old world, data cleaning used to be a large part of the data
> warehouse load. Now that we’re working in a schemaless environment, I’m not
> sure where data cleansing is supposed to take place. NoSQL sounds fun
> because theoretically you just drop everything in but transactional systems
> that generate the data are still full of bugs and create junk data.
>
> My question is, where does data cleaning/master data management/CDI belong
> in a modern data architecture? Before it hit hits Hadoop? After?
>
> B.
>

Re: Data cleansing in modern data architecture

Posted by Shahab Yunus <sh...@gmail.com>.

I am assuming you meant the batch jobs that are/were used in old world for
data cleansing.

As far as I understand there is no hard and fast rule for it and it depends
functional and system requirements of the usecase.

It is also dependent on the technology being used and how it manages
'deletion'.

E.g. in HBase or Cassandra, you can write batch jobs which clean or correct
or remove unwanted or incorrect data and than the underlying stores usually
have a concept of compaction which not only defragments data files but also
at this point removes from disk all the entries marked as deleted.

But there are considerations to be aware of given that compaction is a
heavy process and in some cases (e.g. Cassandra) there can be problems when
there are too much data to be removed. Not only that, in some cases,
marked-to-be-deleted data, until it is deleted/compacted can slow down
normal operations of the data store as well.

One can also leverage in HBase's case the versioning mechanism and the
afore-mentioned batch job can simply overwrite the same row key and the
previous version would no longer be the latest. If max-version parameter is
configured as 1 then no previous version would be maintained (physically it
would be and would be removed at compaction time but would not be
query-able.)

In the end, basically cleansing can be done after or before loading but
given the append-only and no hard-delete design approaches of most nosql
stores, I would say it would be easier to do cleaning before data is loaded
in the nosql store. Of course, it bears repeating that it depends on the
use case.

Having said that, on a side-note and a bit off-topic, it reminds me of the
Lamda Architecture that combines batch and real-time computation for big
data using various technologies and it uses the idea of constant periodic
refreshes to reload the data and within this periodic refresh, the
expectations are that any invalid older data would be corrected and
overwritten by the new refresh load. Those basically the 'batch part' of
the LA takes care of data cleansing by reloading everything. But LA is
mostly for thouse systems which are ok with eventually consistent behavior
and might not be suitable for some systems.

Regards,
Shahab

On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   In the old world, data cleaning used to be a large part of the data
> warehouse load. Now that we’re working in a schemaless environment, I’m not
> sure where data cleansing is supposed to take place. NoSQL sounds fun
> because theoretically you just drop everything in but transactional systems
> that generate the data are still full of bugs and create junk data.
>
> My question is, where does data cleaning/master data management/CDI belong
> in a modern data architecture? Before it hit hits Hadoop? After?
>
> B.
>

Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

B.

Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

B.

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

That’s an interesting use case for Storm. Usually people talk about Storm in terms of processing things like twitter or events like web logs. Never seen it in terms of processing files especially EDI files where they usually come in in terms of groups of transactions instead of atomic events like a single line item in an invoice.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Mark Kerzner 
Sent: Sunday, July 20, 2014 2:08 PM
To: Hadoop User 
Subject: Re: Merging small files

Bob, 

you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed

Mark



On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

  Source file –> Oracle –> Hadoop –> Oracle

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shashidhar Rao 
  Sent: Sunday, July 20, 2014 12:47 PM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




  On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

    Yeah good point.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Shahab Yunus 
    Sent: Sunday, July 20, 2014 11:32 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
    http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


    You can always discuss vendor specific issues in their respective mailing lists.

    As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

    Regards,
    Shahab



    On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

      A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

      https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Kilaru, Sambaiah 
      Sent: Sunday, July 20, 2014 3:47 AM
      To: user@hadoop.apache.org 
      Subject: Re: Merging small files

      This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
      Small files go into one container (to fill up 256MB or what ever container size) and with locality most
      Of the mappers go to three datanodes.

      You should be looking into sequence file format.

      Thanks,
      Sam

      From: "M. C. Srivas" <mc...@gmail.com>
      Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Date: Sunday, July 20, 2014 at 8:01 AM
      To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Subject: Re: Merging small files


      You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



      On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

        Hi ,


        Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



        Regards

        Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

That’s an interesting use case for Storm. Usually people talk about Storm in terms of processing things like twitter or events like web logs. Never seen it in terms of processing files especially EDI files where they usually come in in terms of groups of transactions instead of atomic events like a single line item in an invoice.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Mark Kerzner 
Sent: Sunday, July 20, 2014 2:08 PM
To: Hadoop User 
Subject: Re: Merging small files

Bob, 

you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed

Mark



On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

  Source file –> Oracle –> Hadoop –> Oracle

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shashidhar Rao 
  Sent: Sunday, July 20, 2014 12:47 PM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




  On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

    Yeah good point.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Shahab Yunus 
    Sent: Sunday, July 20, 2014 11:32 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
    http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


    You can always discuss vendor specific issues in their respective mailing lists.

    As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

    Regards,
    Shahab



    On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

      A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

      https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Kilaru, Sambaiah 
      Sent: Sunday, July 20, 2014 3:47 AM
      To: user@hadoop.apache.org 
      Subject: Re: Merging small files

      This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
      Small files go into one container (to fill up 256MB or what ever container size) and with locality most
      Of the mappers go to three datanodes.

      You should be looking into sequence file format.

      Thanks,
      Sam

      From: "M. C. Srivas" <mc...@gmail.com>
      Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Date: Sunday, July 20, 2014 at 8:01 AM
      To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Subject: Re: Merging small files


      You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



      On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

        Hi ,


        Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



        Regards

        Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

That’s an interesting use case for Storm. Usually people talk about Storm in terms of processing things like twitter or events like web logs. Never seen it in terms of processing files especially EDI files where they usually come in in terms of groups of transactions instead of atomic events like a single line item in an invoice.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Mark Kerzner 
Sent: Sunday, July 20, 2014 2:08 PM
To: Hadoop User 
Subject: Re: Merging small files

Bob, 

you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed

Mark



On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

  Source file –> Oracle –> Hadoop –> Oracle

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shashidhar Rao 
  Sent: Sunday, July 20, 2014 12:47 PM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




  On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

    Yeah good point.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Shahab Yunus 
    Sent: Sunday, July 20, 2014 11:32 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
    http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


    You can always discuss vendor specific issues in their respective mailing lists.

    As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

    Regards,
    Shahab



    On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

      A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

      https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Kilaru, Sambaiah 
      Sent: Sunday, July 20, 2014 3:47 AM
      To: user@hadoop.apache.org 
      Subject: Re: Merging small files

      This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
      Small files go into one container (to fill up 256MB or what ever container size) and with locality most
      Of the mappers go to three datanodes.

      You should be looking into sequence file format.

      Thanks,
      Sam

      From: "M. C. Srivas" <mc...@gmail.com>
      Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Date: Sunday, July 20, 2014 at 8:01 AM
      To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Subject: Re: Merging small files


      You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



      On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

        Hi ,


        Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



        Regards

        Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

That’s an interesting use case for Storm. Usually people talk about Storm in terms of processing things like twitter or events like web logs. Never seen it in terms of processing files especially EDI files where they usually come in in terms of groups of transactions instead of atomic events like a single line item in an invoice.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Mark Kerzner 
Sent: Sunday, July 20, 2014 2:08 PM
To: Hadoop User 
Subject: Re: Merging small files

Bob, 

you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed

Mark



On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

  Source file –> Oracle –> Hadoop –> Oracle

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shashidhar Rao 
  Sent: Sunday, July 20, 2014 12:47 PM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




  On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

    Yeah good point.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Shahab Yunus 
    Sent: Sunday, July 20, 2014 11:32 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
    http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


    You can always discuss vendor specific issues in their respective mailing lists.

    As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

    Regards,
    Shahab



    On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

      A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

      https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Kilaru, Sambaiah 
      Sent: Sunday, July 20, 2014 3:47 AM
      To: user@hadoop.apache.org 
      Subject: Re: Merging small files

      This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
      Small files go into one container (to fill up 256MB or what ever container size) and with locality most
      Of the mappers go to three datanodes.

      You should be looking into sequence file format.

      Thanks,
      Sam

      From: "M. C. Srivas" <mc...@gmail.com>
      Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Date: Sunday, July 20, 2014 at 8:01 AM
      To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
      Subject: Re: Merging small files


      You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



      On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

        Hi ,


        Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



        Regards

        Shashi

Re: Merging small files

Posted by Mark Kerzner <ma...@shmsoft.com>.

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Yeah  I’m sorry I’m not talking about processing the files in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoop.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file –> Oracle –> Hadoop –> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <ra...@gmail.com>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million customers
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   “Even if we kept the discussion to the mailing list's technical Hadoop
>> usage focus, any company/organization looking to use a distro is going to
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.”
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailing
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDFS
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should work
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place to
>>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should merge
>>> them into larger files. I’ll take this a step further. If your invoice data
>>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <mc...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoice
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>>> data to make one large file say 1 GB . What is the best practice in this
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

Re: Merging small files

Posted by Mark Kerzner <ma...@shmsoft.com>.

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Yeah  I’m sorry I’m not talking about processing the files in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoop.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file –> Oracle –> Hadoop –> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <ra...@gmail.com>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million customers
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   “Even if we kept the discussion to the mailing list's technical Hadoop
>> usage focus, any company/organization looking to use a distro is going to
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.”
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailing
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDFS
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should work
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place to
>>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should merge
>>> them into larger files. I’ll take this a step further. If your invoice data
>>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <mc...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoice
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>>> data to make one large file say 1 GB . What is the best practice in this
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

Re: Merging small files

Posted by Mark Kerzner <ma...@shmsoft.com>.

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Yeah  I’m sorry I’m not talking about processing the files in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoop.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file –> Oracle –> Hadoop –> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <ra...@gmail.com>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million customers
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   “Even if we kept the discussion to the mailing list's technical Hadoop
>> usage focus, any company/organization looking to use a distro is going to
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.”
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailing
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDFS
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should work
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place to
>>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should merge
>>> them into larger files. I’ll take this a step further. If your invoice data
>>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <mc...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoice
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>>> data to make one large file say 1 GB . What is the best practice in this
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

Re: Merging small files

Posted by Mark Kerzner <ma...@shmsoft.com>.

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Yeah  I’m sorry I’m not talking about processing the files in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoop.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file –> Oracle –> Hadoop –> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <ra...@gmail.com>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million customers
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   “Even if we kept the discussion to the mailing list's technical Hadoop
>> usage focus, any company/organization looking to use a distro is going to
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.”
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailing
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDFS
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should work
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place to
>>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should merge
>>> them into larger files. I’ll take this a step further. If your invoice data
>>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <mc...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoice
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>>> data to make one large file say 1 GB . What is the best practice in this
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

Source file –> Oracle –> Hadoop –> Oracle

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shashidhar Rao 
Sent: Sunday, July 20, 2014 12:47 PM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

  Yeah good point.

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 11:32 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
  http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


  You can always discuss vendor specific issues in their respective mailing lists.

  As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

    A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

    https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Kilaru, Sambaiah 
    Sent: Sunday, July 20, 2014 3:47 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
    Small files go into one container (to fill up 256MB or what ever container size) and with locality most
    Of the mappers go to three datanodes.

    You should be looking into sequence file format.

    Thanks,
    Sam

    From: "M. C. Srivas" <mc...@gmail.com>
    Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Date: Sunday, July 20, 2014 at 8:01 AM
    To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Subject: Re: Merging small files


    You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



    On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

      Hi ,


      Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



      Regards

      Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

Source file –> Oracle –> Hadoop –> Oracle

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shashidhar Rao 
Sent: Sunday, July 20, 2014 12:47 PM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

  Yeah good point.

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 11:32 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
  http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


  You can always discuss vendor specific issues in their respective mailing lists.

  As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

    A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

    https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Kilaru, Sambaiah 
    Sent: Sunday, July 20, 2014 3:47 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
    Small files go into one container (to fill up 256MB or what ever container size) and with locality most
    Of the mappers go to three datanodes.

    You should be looking into sequence file format.

    Thanks,
    Sam

    From: "M. C. Srivas" <mc...@gmail.com>
    Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Date: Sunday, July 20, 2014 at 8:01 AM
    To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Subject: Re: Merging small files


    You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



    On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

      Hi ,


      Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



      Regards

      Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

Source file –> Oracle –> Hadoop –> Oracle

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shashidhar Rao 
Sent: Sunday, July 20, 2014 12:47 PM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

  Yeah good point.

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 11:32 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
  http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


  You can always discuss vendor specific issues in their respective mailing lists.

  As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

    A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

    https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Kilaru, Sambaiah 
    Sent: Sunday, July 20, 2014 3:47 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
    Small files go into one container (to fill up 256MB or what ever container size) and with locality most
    Of the mappers go to three datanodes.

    You should be looking into sequence file format.

    Thanks,
    Sam

    From: "M. C. Srivas" <mc...@gmail.com>
    Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Date: Sunday, July 20, 2014 at 8:01 AM
    To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Subject: Re: Merging small files


    You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



    On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

      Hi ,


      Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



      Regards

      Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent size, flush them to Hadoop in one big file, process them, then store the results of the processing in Oracle.

Source file –> Oracle –> Hadoop –> Oracle

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shashidhar Rao 
Sent: Sunday, July 20, 2014 12:47 PM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated  roughly  by 5 million customers every week from  all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .




On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

  Yeah good point.

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 11:32 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
  http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


  You can always discuss vendor specific issues in their respective mailing lists.

  As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

    A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

    https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Kilaru, Sambaiah 
    Sent: Sunday, July 20, 2014 3:47 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
    Small files go into one container (to fill up 256MB or what ever container size) and with locality most
    Of the mappers go to three datanodes.

    You should be looking into sequence file format.

    Thanks,
    Sam

    From: "M. C. Srivas" <mc...@gmail.com>
    Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Date: Sunday, July 20, 2014 at 8:01 AM
    To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
    Subject: Re: Merging small files


    You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



    On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

      Hi ,


      Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



      Regards

      Shashi

Re: Merging small files

Posted by Shashidhar Rao <ra...@gmail.com>.

Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .


On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
>
> Yeah good point.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>
> You can always discuss vendor specific issues in their respective mailing
> lists.
>
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>>
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>>
>> You should be looking into sequence file format.
>>
>> Thanks,
>> Sam
>>
>> From: "M. C. Srivas" <mc...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Merging small files
>>
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>>
>>
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>>   Hi ,
>>>
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>>
>>>
>>> Regards
>>> Shashi
>>>
>>
>>
>
>

Re: Merging small files

Posted by Shashidhar Rao <ra...@gmail.com>.

Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .


On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
>
> Yeah good point.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>
> You can always discuss vendor specific issues in their respective mailing
> lists.
>
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>>
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>>
>> You should be looking into sequence file format.
>>
>> Thanks,
>> Sam
>>
>> From: "M. C. Srivas" <mc...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Merging small files
>>
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>>
>>
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>>   Hi ,
>>>
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>>
>>>
>>> Regards
>>> Shashi
>>>
>>
>>
>
>

Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

B.

Re: Merging small files

Posted by Shashidhar Rao <ra...@gmail.com>.

Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .


On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
>
> Yeah good point.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>
> You can always discuss vendor specific issues in their respective mailing
> lists.
>
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>>
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>>
>> You should be looking into sequence file format.
>>
>> Thanks,
>> Sam
>>
>> From: "M. C. Srivas" <mc...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Merging small files
>>
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>>
>>
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>>   Hi ,
>>>
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>>
>>>
>>> Regards
>>> Shashi
>>>
>>
>>
>
>

Re: Merging small files

Posted by Shashidhar Rao <ra...@gmail.com>.

Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .


On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
>
> Yeah good point.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>
> You can always discuss vendor specific issues in their respective mailing
> lists.
>
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>>
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>>
>> You should be looking into sequence file format.
>>
>> Thanks,
>> Sam
>>
>> From: "M. C. Srivas" <mc...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Merging small files
>>
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>>
>>
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>>   Hi ,
>>>
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>>
>>>
>>> Regards
>>> Shashi
>>>
>>
>>
>
>

Data cleansing in modern data architecture

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data. 

My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After?

B.

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

Yeah good point.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 11:32 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


You can always discuss vendor specific issues in their respective mailing lists.

As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

Regards,
Shahab



On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

  A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

  https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Kilaru, Sambaiah 
  Sent: Sunday, July 20, 2014 3:47 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
  Small files go into one container (to fill up 256MB or what ever container size) and with locality most
  Of the mappers go to three datanodes.

  You should be looking into sequence file format.

  Thanks,
  Sam

  From: "M. C. Srivas" <mc...@gmail.com>
  Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Date: Sunday, July 20, 2014 at 8:01 AM
  To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Subject: Re: Merging small files


  You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



  On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

    Hi ,


    Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



    Regards

    Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

Yeah good point.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 11:32 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


You can always discuss vendor specific issues in their respective mailing lists.

As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

Regards,
Shahab



On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

  A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

  https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Kilaru, Sambaiah 
  Sent: Sunday, July 20, 2014 3:47 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
  Small files go into one container (to fill up 256MB or what ever container size) and with locality most
  Of the mappers go to three datanodes.

  You should be looking into sequence file format.

  Thanks,
  Sam

  From: "M. C. Srivas" <mc...@gmail.com>
  Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Date: Sunday, July 20, 2014 at 8:01 AM
  To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Subject: Re: Merging small files


  You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



  On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

    Hi ,


    Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



    Regards

    Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

Yeah good point.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 11:32 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


You can always discuss vendor specific issues in their respective mailing lists.

As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

Regards,
Shahab



On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

  A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

  https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Kilaru, Sambaiah 
  Sent: Sunday, July 20, 2014 3:47 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
  Small files go into one container (to fill up 256MB or what ever container size) and with locality most
  Of the mappers go to three datanodes.

  You should be looking into sequence file format.

  Thanks,
  Sam

  From: "M. C. Srivas" <mc...@gmail.com>
  Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Date: Sunday, July 20, 2014 at 8:01 AM
  To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Subject: Re: Merging small files


  You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



  On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

    Hi ,


    Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



    Regards

    Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any company/organization looking to use a distro is going to have to consider the costs, support, platform, partner ecosystem, market share, company strategy, etc.”

Yeah good point.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus 
Sent: Sunday, July 20, 2014 11:32 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral apache mailing list? Checkout this thread: 
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


You can always discuss vendor specific issues in their respective mailing lists.

As for merging files, Yes one can use HBase but then you have to keep in mind that you are adding overhead of development and maintenance of a another store (i.e. HBase). If your use case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge of the requirements that the OP provided, I think Sequence File format should work as I suggested initially. Of course, if things get too complicated from requirements perspective then one might try out HBase.

Regards,
Shahab



On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

  A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

  https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Kilaru, Sambaiah 
  Sent: Sunday, July 20, 2014 3:47 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
  Small files go into one container (to fill up 256MB or what ever container size) and with locality most
  Of the mappers go to three datanodes.

  You should be looking into sequence file format.

  Thanks,
  Sam

  From: "M. C. Srivas" <mc...@gmail.com>
  Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Date: Sunday, July 20, 2014 at 8:01 AM
  To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
  Subject: Re: Merging small files


  You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



  On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

    Hi ,


    Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



    Regards

    Shashi

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E

You can always discuss vendor specific issues in their respective mailing
lists.

As for merging files, Yes one can use HBase but then you have to keep in
mind that you are adding overhead of development and maintenance of a
another store (i.e. HBase). If your use case could be satisfied with HDFS
alone then why not keep it simple? And given the knowledge of the
requirements that the OP provided, I think Sequence File format should work
as I suggested initially. Of course, if things get too complicated from
requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
> me that MapR is an implementation of Hadoop and this is a great place to
> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>
> A little bit more on topic: Every single thing I read or watch about
> Hadoop says that many small files is a bad idea and that you should merge
> them into larger files. I’ll take this a step further. If your invoice data
> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
> are trying to do and a more traditional RDBMS approach would be more
> appropriate. Someone suggested HBase and I was going to suggest maybe one
> of the other NoSQL databases, however, I remember that Eddie Satterly of
> Splunk says that financial data is the ONE use case where a traditional
> approach is more appropriate. You can watch his talk here:
>
> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
> *Sent:* Sunday, July 20, 2014 3:47 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
> You should be looking into sequence file format.
>
> Thanks,
> Sam
>
> From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>  You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E

You can always discuss vendor specific issues in their respective mailing
lists.

As for merging files, Yes one can use HBase but then you have to keep in
mind that you are adding overhead of development and maintenance of a
another store (i.e. HBase). If your use case could be satisfied with HDFS
alone then why not keep it simple? And given the knowledge of the
requirements that the OP provided, I think Sequence File format should work
as I suggested initially. Of course, if things get too complicated from
requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
> me that MapR is an implementation of Hadoop and this is a great place to
> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>
> A little bit more on topic: Every single thing I read or watch about
> Hadoop says that many small files is a bad idea and that you should merge
> them into larger files. I’ll take this a step further. If your invoice data
> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
> are trying to do and a more traditional RDBMS approach would be more
> appropriate. Someone suggested HBase and I was going to suggest maybe one
> of the other NoSQL databases, however, I remember that Eddie Satterly of
> Splunk says that financial data is the ONE use case where a traditional
> approach is more appropriate. You can watch his talk here:
>
> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
> *Sent:* Sunday, July 20, 2014 3:47 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
> You should be looking into sequence file format.
>
> Thanks,
> Sam
>
> From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>  You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E

You can always discuss vendor specific issues in their respective mailing
lists.

As for merging files, Yes one can use HBase but then you have to keep in
mind that you are adding overhead of development and maintenance of a
another store (i.e. HBase). If your use case could be satisfied with HDFS
alone then why not keep it simple? And given the knowledge of the
requirements that the OP provided, I think Sequence File format should work
as I suggested initially. Of course, if things get too complicated from
requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
> me that MapR is an implementation of Hadoop and this is a great place to
> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>
> A little bit more on topic: Every single thing I read or watch about
> Hadoop says that many small files is a bad idea and that you should merge
> them into larger files. I’ll take this a step further. If your invoice data
> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
> are trying to do and a more traditional RDBMS approach would be more
> appropriate. Someone suggested HBase and I was going to suggest maybe one
> of the other NoSQL databases, however, I remember that Eddie Satterly of
> Splunk says that financial data is the ONE use case where a traditional
> approach is more appropriate. You can watch his talk here:
>
> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
> *Sent:* Sunday, July 20, 2014 3:47 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
> You should be looking into sequence file format.
>
> Thanks,
> Sam
>
> From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>  You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E

You can always discuss vendor specific issues in their respective mailing
lists.

As for merging files, Yes one can use HBase but then you have to keep in
mind that you are adding overhead of development and maintenance of a
another store (i.e. HBase). If your use case could be satisfied with HDFS
alone then why not keep it simple? And given the knowledge of the
requirements that the OP provided, I think Sequence File format should work
as I suggested initially. Of course, if things get too complicated from
requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
> me that MapR is an implementation of Hadoop and this is a great place to
> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>
> A little bit more on topic: Every single thing I read or watch about
> Hadoop says that many small files is a bad idea and that you should merge
> them into larger files. I’ll take this a step further. If your invoice data
> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
> are trying to do and a more traditional RDBMS approach would be more
> appropriate. Someone suggested HBase and I was going to suggest maybe one
> of the other NoSQL databases, however, I remember that Eddie Satterly of
> Splunk says that financial data is the ONE use case where a traditional
> approach is more appropriate. You can watch his talk here:
>
> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Kilaru, Sambaiah <Sa...@intuit.com>
> *Sent:* Sunday, July 20, 2014 3:47 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
> You should be looking into sequence file format.
>
> Thanks,
> Sam
>
> From: "M. C. Srivas" <mc...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Merging small files
>
>  You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>
>

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one container (of 256MB or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile format) you are actually launching jobs on the three nodes due
To data localization issue.

Small files are bad with Hadoop and worse with mapr when you wanted to run job and good with storing.


Sam

From: MBA <ad...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah<ma...@intuit.com>
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one container (of 256MB or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile format) you are actually launching jobs on the three nodes due
To data localization issue.

Small files are bad with Hadoop and worse with mapr when you wanted to run job and good with storing.


Sam

From: MBA <ad...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah<ma...@intuit.com>
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one container (of 256MB or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile format) you are actually launching jobs on the three nodes due
To data localization issue.

Small files are bad with Hadoop and worse with mapr when you wanted to run job and good with storing.


Sam

From: MBA <ad...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah<ma...@intuit.com>
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one container (of 256MB or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile format) you are actually launching jobs on the three nodes due
To data localization issue.

Small files are bad with Hadoop and worse with mapr when you wanted to run job and good with storing.


Sam

From: MBA <ad...@hotmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah<ma...@intuit.com>
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah 
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Merging small files


You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

  Hi ,


  Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



  Regards

  Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah 
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Merging small files


You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

  Hi ,


  Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



  Regards

  Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah 
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Merging small files


You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

  Hi ,


  Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



  Regards

  Shashi

Re: Merging small files

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the Hortonworks or Cloudera offering. 

A little bit more on topic: Every single thing I read or watch about Hadoop says that many small files is a bad idea and that you should merge them into larger files. I’ll take this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do and a more traditional RDBMS approach would be more appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL databases, however, I remember that Eddie Satterly of Splunk says that financial data is the ONE use case where a traditional approach is more appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah 
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>
Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Merging small files


You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)



On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com> wrote:

  Hi ,


  Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario



  Regards

  Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "Kilaru, Sambaiah" <Sa...@intuit.com>.

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <mc...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <ra...@gmail.com>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB but generally if we have to process retail invoice data , each invoice data is merely let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Posted by "M. C. Srivas" <mc...@gmail.com>.

You should look at MapR .... a few 100's of billions of small files is
absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by "M. C. Srivas" <mc...@gmail.com>.

You should look at MapR .... a few 100's of billions of small files is
absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by Steven Zhuang <zh...@gmail.com>.

Maybe you should just try HBase.


On Sat, Jul 19, 2014 at 10:57 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> It is not advisable to have many small files in hdfs as it can put memory
> load on Namenode as it maintains the metadata, to highlight one major issue.
>
> On the top of my head, some basic ideas...You can either combine invoices
> into a bigger text file containing a collection of records where each
> record is an  invoices or even follow a sequence file format where the id
> could be the invoice id and value/record the invoice details.
>
> Regards,
> Shahab
> On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>


-- 
        best wishes.
                Steven

Re: Merging small files

Posted by Steven Zhuang <zh...@gmail.com>.

Maybe you should just try HBase.


On Sat, Jul 19, 2014 at 10:57 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> It is not advisable to have many small files in hdfs as it can put memory
> load on Namenode as it maintains the metadata, to highlight one major issue.
>
> On the top of my head, some basic ideas...You can either combine invoices
> into a bigger text file containing a collection of records where each
> record is an  invoices or even follow a sequence file format where the id
> could be the invoice id and value/record the invoice details.
>
> Regards,
> Shahab
> On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>


-- 
        best wishes.
                Steven

Re: Merging small files

Posted by Steven Zhuang <zh...@gmail.com>.

Maybe you should just try HBase.


On Sat, Jul 19, 2014 at 10:57 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> It is not advisable to have many small files in hdfs as it can put memory
> load on Namenode as it maintains the metadata, to highlight one major issue.
>
> On the top of my head, some basic ideas...You can either combine invoices
> into a bigger text file containing a collection of records where each
> record is an  invoices or even follow a sequence file format where the id
> could be the invoice id and value/record the invoice details.
>
> Regards,
> Shahab
> On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>


-- 
        best wishes.
                Steven

Re: Merging small files

Posted by Steven Zhuang <zh...@gmail.com>.

Maybe you should just try HBase.


On Sat, Jul 19, 2014 at 10:57 AM, Shahab Yunus <sh...@gmail.com>
wrote:

> It is not advisable to have many small files in hdfs as it can put memory
> load on Namenode as it maintains the metadata, to highlight one major issue.
>
> On the top of my head, some basic ideas...You can either combine invoices
> into a bigger text file containing a collection of records where each
> record is an  invoices or even follow a sequence file format where the id
> could be the invoice id and value/record the invoice details.
>
> Regards,
> Shahab
> On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>


-- 
        best wishes.
                Steven

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

It is not advisable to have many small files in hdfs as it can put memory
load on Namenode as it maintains the metadata, to highlight one major issue.

On the top of my head, some basic ideas...You can either combine invoices
into a bigger text file containing a collection of records where each
record is an  invoices or even follow a sequence file format where the id
could be the invoice id and value/record the invoice details.

Regards,
Shahab
On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

It is not advisable to have many small files in hdfs as it can put memory
load on Namenode as it maintains the metadata, to highlight one major issue.

On the top of my head, some basic ideas...You can either combine invoices
into a bigger text file containing a collection of records where each
record is an  invoices or even follow a sequence file format where the id
could be the invoice id and value/record the invoice details.

Regards,
Shahab
On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by "M. C. Srivas" <mc...@gmail.com>.

You should look at MapR .... a few 100's of billions of small files is
absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

It is not advisable to have many small files in hdfs as it can put memory
load on Namenode as it maintains the metadata, to highlight one major issue.

On the top of my head, some basic ideas...You can either combine invoices
into a bigger text file containing a collection of records where each
record is an  invoices or even follow a sequence file format where the id
could be the invoice id and value/record the invoice details.

Regards,
Shahab
On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by "M. C. Srivas" <mc...@gmail.com>.

You should look at MapR .... a few 100's of billions of small files is
absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>

Re: Merging small files

Posted by Shahab Yunus <sh...@gmail.com>.

It is not advisable to have many small files in hdfs as it can put memory
load on Namenode as it maintains the metadata, to highlight one major issue.

On the top of my head, some basic ideas...You can either combine invoices
into a bigger text file containing a collection of records where each
record is an  invoices or even follow a sequence file format where the id
could be the invoice id and value/record the invoice details.

Regards,
Shahab
On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <ra...@gmail.com>
wrote:

> Hi ,
>
> Has anybody worked in retail use case. If my production Hadoop cluster
> block size is 256 MB but generally if we have to process retail invoice
> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
> data to make one large file say 1 GB . What is the best practice in this
> scenario
>
>
> Regards
> Shashi
>