You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Kommareddi, Mahesh" <Ma...@e-hps.com> on 2016/03/22 21:09:42 UTC

Process exported files with sections defined in file

Hi All. I'm still kind of new and I have a question about how to efficiently process a file that has a bulk export of a database with certain row records being defined by successive information regarding it. In other words, a row from table A is exported from
a table to the file, then the records associated with the defining id is exported from table B into the file. The process is repeated for each row in table A.

For example...
15 ident1(1) ident1(2) <--- defines identifying information to successive “20 records"
20 info1(ident1)(1) info1(ident1)(2) info1(ident1)(3)<---record for this "15 type record”
.
.
.
20 infoN(ident1)(1) infoN(ident1)(2) infoN(ident1)(3)<---record for this "15 type record”
.
.
.
15 ident2(1) ident2(2) <--- defines new id to for next group of 20 type records
20 infoX(ident2)(1) info1X(2) info1X(3)<---record for this "15 type record”
20 infoX+1(ident2)(1) infoX+1(ident2)(2) infoX+1(ident2)(3)<---record for this "15 type record”
.
.
.

Until next 15 type record appears. All followed by arbitrary 20 type records. Then another 15 record type followed by more 20s regarding the new
15 type record ad infinitum.

I was hoping to do map-reduce on this data of various sorts. For example, I want to find the max info value in each column per each 15 “section”. Is
there any sort of way to handle that?

I was hoping I wouldn't have to split the file myself… These files get to be 22GB each.


I thought a strategy close to processing XML files would be useful, but I don’t think that would apply here.

I would appreciate any help and insight.

Best Regards,
Mahesh