You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Harrison Cavallero <ha...@cavallero.me> on 2014/06/16 21:17:27 UTC

Multiple CSV Files

Hi everyone,

Thanks in advance to the Pig community, great tool! We're using pig for a
project that essentially takes in a client specific csv file, filters out
the data we want, transforms it to the format we want and then writes to a
client specific database (basic ETL). Currently we've implemented this pig
script, and just pass in the client database name along with the csv file
location, and then use a shell script to call the pig script for each
client csv file we have (parsing the client name for the database name
param and pass that in as well).

Basically like this:
single_csv = load 'file_name.csv' using PigStorage() as (fields);
-- filter single_csv;
STORE filtered_set INTO table USING 'DBStorage(driver,
unique_client_db_info, insert into table (columns) values (?,?)'

My question is, is it possible/advisable to load all the csv files with
something like:
 ' all_csv = load '*.csv' using PigStorage() as
(client_name:nameParsefunc(), field1:$1, field2:$2)'

Above I'm thinking of somehow parsing the unique client filename for each
csv and inserting the unique client name into each tuple stored, so it is
properly associated with the client? That way we can write each tuple to
the correct database so the 'unique_client' data in the store query would
just be dynamic loaded via ? for each tuple.

Or is our approach best practice, keep the pig script light, doing the ETL
on a per csv basis?

I hope this makes sense and thanks in advance for any feedback, suggestions
or criticisms for how we're going about this! Also, let me know if this
would be better to bring to the IRC channel...

-- 
Harrison Cavallero

*cavallero.me <http://cavallero.me>*