You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kayla Jay <ka...@yahoo.com> on 2008/05/05 15:18:28 UTC

Query against different data types within HDFS using Map/Reduce

Has anyone come across this scenario and if not, does anyone have any suggestions?

What if you store different types of data within HDFS.  You store XML, text, binary, sequence files, etc.  You now want to run a query against ALL of the data stored within HDFS via a map/reduce job.  How do you do this if the data input is different types?
For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source.  Even the example of word count could be an example but given that not all data is textual line-by-line.  The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS.  What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
If not, any suggestions on how to handle this?  

Thanks.



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: Query against different data types within HDFS using Map/Reduce

Posted by Jason Venner <ja...@attributor.com>.
We do this all the time.
In one case we have the mapper work out the input type by examining the 
input file name and the record data. We tend to do this for the textual 
keyTABvalue records

In another case we have a container object that can hold any writable, 
that we pass around. We do this for data that has binary data that is to 
large to bother base64 encoding, or where we explicitly have to reduce 
multiple data types where we can't readily tell what the data type is.



Ted Dunning wrote:
> You just have to write an adapted input format that reads multiple kinds of input.
>
> It can key off the contents of the file or the name.  Depending on names is bad, but has a long lineage so people tend to deal with it reasonably well.
>
> It isn't very hard to write.
>
> -----Original Message-----
> From: Kayla Jay [mailto:kaylais30@yahoo.com]
> Sent: Mon 5/5/2008 6:18 AM
> To: core-user@hadoop.apache.org
> Subject: Query against different data types within HDFS using Map/Reduce
>  
> Has anyone come across this scenario and if not, does anyone have any suggestions?
>
> What if you store different types of data within HDFS.  You store XML, text, binary, sequence files, etc.  You now want to run a query against ALL of the data stored within HDFS via a map/reduce job.  How do you do this if the data input is different types?
> For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source.  Even the example of word count could be an example but given that not all data is textual line-by-line.  The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS.  What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.
>
> I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?
>
> Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
> If not, any suggestions on how to handle this?  
>
> Thanks.
>
>
>
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>
>   
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

RE: Query against different data types within HDFS using Map/Reduce

Posted by Ted Dunning <td...@veoh.com>.
You just have to write an adapted input format that reads multiple kinds of input.

It can key off the contents of the file or the name.  Depending on names is bad, but has a long lineage so people tend to deal with it reasonably well.

It isn't very hard to write.

-----Original Message-----
From: Kayla Jay [mailto:kaylais30@yahoo.com]
Sent: Mon 5/5/2008 6:18 AM
To: core-user@hadoop.apache.org
Subject: Query against different data types within HDFS using Map/Reduce
 
Has anyone come across this scenario and if not, does anyone have any suggestions?

What if you store different types of data within HDFS.  You store XML, text, binary, sequence files, etc.  You now want to run a query against ALL of the data stored within HDFS via a map/reduce job.  How do you do this if the data input is different types?
For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source.  Even the example of word count could be an example but given that not all data is textual line-by-line.  The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS.  What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
If not, any suggestions on how to handle this?  

Thanks.



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ