You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Johannes Schwenk <jo...@adition.com> on 2012/06/15 14:13:14 UTC

Mixed input formats in LOAD path

Hi all,

is it possible to have an input path (as parameter to a LOAD statement)
that contains several files in *different formats* - say serialized Avro
data and tab separated values and make pig read the data into one alias?
I guess I have to write an UDF for this? How should I start, can you
sketch out a rough plan on how to proceed?


Greetings,
Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Mixed input formats in LOAD path

Posted by Johannes Schwenk <jo...@adition.com>.
Well I don't consider this strategy of an data format migration to be a
hack. The only thing that is somewhat "hacky" and definitely not elegant
is the creation of empty files for each known format by the logger!

Do you have any advice on how to design our pig scripts that they
account for migration situations like described in my earlier mail?

Thanks,
Johannes

Am 15.06.2012 15:55, schrieb Ruslan Al-Fakikh:
> Hey,
> 
> You can keep a single empty file per format. That way pig won't fail.
> But basically I recommend to avoid such situations that need hacks or
> custom formats. According to my experience you'll soon get in trouble
> with that.
> 
> Thanks
> 
> On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
> <jo...@adition.com> wrote:
>> Thanks a lot Ruslan, that seems one possible direction!
>>
>> One things stands to be resolved: I don't know whether I will get an
>> Avro in the input or CSV, TSV or all... So how could I get pig not to
>> choke on missing input files?
>>
>> Johannes
>>
>> Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
>>> I guess you could use globbing for extracting the files by extensions,
>>> like this:
>>> $ ls
>>> input.avro  input.txt
>>> $ cat input.avro
>>> avro1
>>> avro2
>>> $ cat input.txt
>>> txt1
>>> txt2
>>>
>>> [cloudera@localhost workpig]$ pig -x local
>>> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
>>> error messages to: /home/cloudera/workpig/pig_1339766469585.log
>>> 2012-06-15 17:21:09,892 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> Connecting to hadoop file system at: file:///
>>> grunt> txt = LOAD '*.txt';
>>> grunt> avro = LOAD '*.avro';
>>> grunt> result = UNION txt, avro;
>>> grunt> DUMP result;
>>> (txt1)
>>> (txt2)
>>> (avro1)
>>> (avro2)
>>>
>>> Please note that the input.avro file is actually not Avro, so you'll
>>> need to use the Avro loader in the LOAD statement.
>>>
>>> Ruslan
>>>
>>> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
>>> <jo...@adition.com> wrote:
>>>> Hi Ruslan,
>>>>
>>>> thanks for you answer!
>>>>
>>>> I have only the input path, but do not know which file format the
>>>> different files in that path possess. All files that are in the path
>>>> belong to one relation however, so i want to load them at once. Though a
>>>> union of separately loaded files would be ok too, if that is possible to
>>>> achieve. Important is, that the LOAD automatically takes care of the
>>>> different formats.
>>>>
>>>> To illustrate further consider the following scenario:
>>>>
>>>> 1. Our logging system writes log data to LOG_PATH.
>>>> 2. The current format is tab separated values.
>>>> 3. We LOAD '$LOG_PATH'
>>>> 4. We switch to Avro format and have to migrate.
>>>> 5. The migration can not happen instantly, so it might be that at some
>>>> point in time some files in  LOG_PATH still have the TSV format while
>>>> other are already switched to Avro.
>>>>
>>>> Thanks,
>>>> Johannes
>>>>
>>>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>>>> Hi Johannes,
>>>>>
>>>>> I guess you'd have to write a custom Loader for such a situation, but
>>>>> why do you need to load everything in one pass? You can load different
>>>>> types of files separately (having multiple LOAD statements) and make a
>>>>> join or a union afterwards.
>>>>>
>>>>> Ruslan
>>>>>
>>>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>>>> <jo...@adition.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>>>> that contains several files in *different formats* - say serialized Avro
>>>>>> data and tab separated values and make pig read the data into one alias?
>>>>>> I guess I have to write an UDF for this? How should I start, can you
>>>>>> sketch out a rough plan on how to proceed?
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>> Johannes Schwenk
>>>>>>
>>>>>> --
>>>>>> Softwareentwickler (Reporting)
>>>>>> ________________________________________________________
>>>>>>
>>>>>> ADITION technologies AG
>>>>>> Schwarzwaldstraße 78b
>>>>>> 79117 Freiburg
>>>>>>
>>>>>> http://www.adition.com
>>>>>>
>>>>>> T +49 / (0)761 / 88147 - 30
>>>>>> F +49 / (0)761 / 88147 - 77
>>>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>>>
>>>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>>>
>>>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>>>> UStIDNr.: DE 218 858 434
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> Johannes Schwenk
>>>>
>>>> --
>>>> Softwareentwickler (Reporting)
>>>> ________________________________________________________
>>>>
>>>> ADITION technologies AG
>>>> Schwarzwaldstraße 78b
>>>> 79117 Freiburg
>>>>
>>>> http://www.adition.com
>>>>
>>>> T +49 / (0)761 / 88147 - 30
>>>> F +49 / (0)761 / 88147 - 77
>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>
>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>> UStIDNr.: DE 218 858 434
>>>>
>>>
>>>
>>>
>>
>>
>>
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
>>
> 
> 
> 



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Mixed input formats in LOAD path

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.
Hey,

You can keep a single empty file per format. That way pig won't fail.
But basically I recommend to avoid such situations that need hacks or
custom formats. According to my experience you'll soon get in trouble
with that.

Thanks

On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
<jo...@adition.com> wrote:
> Thanks a lot Ruslan, that seems one possible direction!
>
> One things stands to be resolved: I don't know whether I will get an
> Avro in the input or CSV, TSV or all... So how could I get pig not to
> choke on missing input files?
>
> Johannes
>
> Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
>> I guess you could use globbing for extracting the files by extensions,
>> like this:
>> $ ls
>> input.avro  input.txt
>> $ cat input.avro
>> avro1
>> avro2
>> $ cat input.txt
>> txt1
>> txt2
>>
>> [cloudera@localhost workpig]$ pig -x local
>> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
>> error messages to: /home/cloudera/workpig/pig_1339766469585.log
>> 2012-06-15 17:21:09,892 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting to hadoop file system at: file:///
>> grunt> txt = LOAD '*.txt';
>> grunt> avro = LOAD '*.avro';
>> grunt> result = UNION txt, avro;
>> grunt> DUMP result;
>> (txt1)
>> (txt2)
>> (avro1)
>> (avro2)
>>
>> Please note that the input.avro file is actually not Avro, so you'll
>> need to use the Avro loader in the LOAD statement.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
>> <jo...@adition.com> wrote:
>>> Hi Ruslan,
>>>
>>> thanks for you answer!
>>>
>>> I have only the input path, but do not know which file format the
>>> different files in that path possess. All files that are in the path
>>> belong to one relation however, so i want to load them at once. Though a
>>> union of separately loaded files would be ok too, if that is possible to
>>> achieve. Important is, that the LOAD automatically takes care of the
>>> different formats.
>>>
>>> To illustrate further consider the following scenario:
>>>
>>> 1. Our logging system writes log data to LOG_PATH.
>>> 2. The current format is tab separated values.
>>> 3. We LOAD '$LOG_PATH'
>>> 4. We switch to Avro format and have to migrate.
>>> 5. The migration can not happen instantly, so it might be that at some
>>> point in time some files in  LOG_PATH still have the TSV format while
>>> other are already switched to Avro.
>>>
>>> Thanks,
>>> Johannes
>>>
>>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>>> Hi Johannes,
>>>>
>>>> I guess you'd have to write a custom Loader for such a situation, but
>>>> why do you need to load everything in one pass? You can load different
>>>> types of files separately (having multiple LOAD statements) and make a
>>>> join or a union afterwards.
>>>>
>>>> Ruslan
>>>>
>>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>>> <jo...@adition.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>>> that contains several files in *different formats* - say serialized Avro
>>>>> data and tab separated values and make pig read the data into one alias?
>>>>> I guess I have to write an UDF for this? How should I start, can you
>>>>> sketch out a rough plan on how to proceed?
>>>>>
>>>>>
>>>>> Greetings,
>>>>> Johannes Schwenk
>>>>>
>>>>> --
>>>>> Softwareentwickler (Reporting)
>>>>> ________________________________________________________
>>>>>
>>>>> ADITION technologies AG
>>>>> Schwarzwaldstraße 78b
>>>>> 79117 Freiburg
>>>>>
>>>>> http://www.adition.com
>>>>>
>>>>> T +49 / (0)761 / 88147 - 30
>>>>> F +49 / (0)761 / 88147 - 77
>>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>>
>>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>>
>>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>>> UStIDNr.: DE 218 858 434
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> Johannes Schwenk
>>>
>>> --
>>> Softwareentwickler (Reporting)
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>>>
>>
>>
>>
>
>
>
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Mixed input formats in LOAD path

Posted by Johannes Schwenk <jo...@adition.com>.
Thanks a lot Ruslan, that seems one possible direction!

One things stands to be resolved: I don't know whether I will get an
Avro in the input or CSV, TSV or all... So how could I get pig not to
choke on missing input files?

Johannes

Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
> I guess you could use globbing for extracting the files by extensions,
> like this:
> $ ls
> input.avro  input.txt
> $ cat input.avro
> avro1
> avro2
> $ cat input.txt
> txt1
> txt2
> 
> [cloudera@localhost workpig]$ pig -x local
> 2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
> error messages to: /home/cloudera/workpig/pig_1339766469585.log
> 2012-06-15 17:21:09,892 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: file:///
> grunt> txt = LOAD '*.txt';
> grunt> avro = LOAD '*.avro';
> grunt> result = UNION txt, avro;
> grunt> DUMP result;
> (txt1)
> (txt2)
> (avro1)
> (avro2)
> 
> Please note that the input.avro file is actually not Avro, so you'll
> need to use the Avro loader in the LOAD statement.
> 
> Ruslan
> 
> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
> <jo...@adition.com> wrote:
>> Hi Ruslan,
>>
>> thanks for you answer!
>>
>> I have only the input path, but do not know which file format the
>> different files in that path possess. All files that are in the path
>> belong to one relation however, so i want to load them at once. Though a
>> union of separately loaded files would be ok too, if that is possible to
>> achieve. Important is, that the LOAD automatically takes care of the
>> different formats.
>>
>> To illustrate further consider the following scenario:
>>
>> 1. Our logging system writes log data to LOG_PATH.
>> 2. The current format is tab separated values.
>> 3. We LOAD '$LOG_PATH'
>> 4. We switch to Avro format and have to migrate.
>> 5. The migration can not happen instantly, so it might be that at some
>> point in time some files in  LOG_PATH still have the TSV format while
>> other are already switched to Avro.
>>
>> Thanks,
>> Johannes
>>
>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>>> Hi Johannes,
>>>
>>> I guess you'd have to write a custom Loader for such a situation, but
>>> why do you need to load everything in one pass? You can load different
>>> types of files separately (having multiple LOAD statements) and make a
>>> join or a union afterwards.
>>>
>>> Ruslan
>>>
>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>>> <jo...@adition.com> wrote:
>>>> Hi all,
>>>>
>>>> is it possible to have an input path (as parameter to a LOAD statement)
>>>> that contains several files in *different formats* - say serialized Avro
>>>> data and tab separated values and make pig read the data into one alias?
>>>> I guess I have to write an UDF for this? How should I start, can you
>>>> sketch out a rough plan on how to proceed?
>>>>
>>>>
>>>> Greetings,
>>>> Johannes Schwenk
>>>>
>>>> --
>>>> Softwareentwickler (Reporting)
>>>> ________________________________________________________
>>>>
>>>> ADITION technologies AG
>>>> Schwarzwaldstraße 78b
>>>> 79117 Freiburg
>>>>
>>>> http://www.adition.com
>>>>
>>>> T +49 / (0)761 / 88147 - 30
>>>> F +49 / (0)761 / 88147 - 77
>>>> SUPPORT +49  / (0)1805 - ADITION
>>>>
>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>>
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>>> UStIDNr.: DE 218 858 434
>>>>
>>>
>>>
>>>
>>
>>
>>
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
>>
> 
> 
> 



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Mixed input formats in LOAD path

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.
I guess you could use globbing for extracting the files by extensions,
like this:
$ ls
input.avro  input.txt
$ cat input.avro
avro1
avro2
$ cat input.txt
txt1
txt2

[cloudera@localhost workpig]$ pig -x local
2012-06-15 17:21:09,613 [main] INFO  org.apache.pig.Main - Logging
error messages to: /home/cloudera/workpig/pig_1339766469585.log
2012-06-15 17:21:09,892 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: file:///
grunt> txt = LOAD '*.txt';
grunt> avro = LOAD '*.avro';
grunt> result = UNION txt, avro;
grunt> DUMP result;
(txt1)
(txt2)
(avro1)
(avro2)

Please note that the input.avro file is actually not Avro, so you'll
need to use the Avro loader in the LOAD statement.

Ruslan

On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk
<jo...@adition.com> wrote:
> Hi Ruslan,
>
> thanks for you answer!
>
> I have only the input path, but do not know which file format the
> different files in that path possess. All files that are in the path
> belong to one relation however, so i want to load them at once. Though a
> union of separately loaded files would be ok too, if that is possible to
> achieve. Important is, that the LOAD automatically takes care of the
> different formats.
>
> To illustrate further consider the following scenario:
>
> 1. Our logging system writes log data to LOG_PATH.
> 2. The current format is tab separated values.
> 3. We LOAD '$LOG_PATH'
> 4. We switch to Avro format and have to migrate.
> 5. The migration can not happen instantly, so it might be that at some
> point in time some files in  LOG_PATH still have the TSV format while
> other are already switched to Avro.
>
> Thanks,
> Johannes
>
> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
>> Hi Johannes,
>>
>> I guess you'd have to write a custom Loader for such a situation, but
>> why do you need to load everything in one pass? You can load different
>> types of files separately (having multiple LOAD statements) and make a
>> join or a union afterwards.
>>
>> Ruslan
>>
>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
>> <jo...@adition.com> wrote:
>>> Hi all,
>>>
>>> is it possible to have an input path (as parameter to a LOAD statement)
>>> that contains several files in *different formats* - say serialized Avro
>>> data and tab separated values and make pig read the data into one alias?
>>> I guess I have to write an UDF for this? How should I start, can you
>>> sketch out a rough plan on how to proceed?
>>>
>>>
>>> Greetings,
>>> Johannes Schwenk
>>>
>>> --
>>> Softwareentwickler (Reporting)
>>> ________________________________________________________
>>>
>>> ADITION technologies AG
>>> Schwarzwaldstraße 78b
>>> 79117 Freiburg
>>>
>>> http://www.adition.com
>>>
>>> T +49 / (0)761 / 88147 - 30
>>> F +49 / (0)761 / 88147 - 77
>>> SUPPORT +49  / (0)1805 - ADITION
>>>
>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>>
>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>>> UStIDNr.: DE 218 858 434
>>>
>>
>>
>>
>
>
>
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Mixed input formats in LOAD path

Posted by Johannes Schwenk <jo...@adition.com>.
Hi Ruslan,

thanks for you answer!

I have only the input path, but do not know which file format the
different files in that path possess. All files that are in the path
belong to one relation however, so i want to load them at once. Though a
union of separately loaded files would be ok too, if that is possible to
achieve. Important is, that the LOAD automatically takes care of the
different formats.

To illustrate further consider the following scenario:

1. Our logging system writes log data to LOG_PATH.
2. The current format is tab separated values.
3. We LOAD '$LOG_PATH'
4. We switch to Avro format and have to migrate.
5. The migration can not happen instantly, so it might be that at some
point in time some files in  LOG_PATH still have the TSV format while
other are already switched to Avro.

Thanks,
Johannes

Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh:
> Hi Johannes,
> 
> I guess you'd have to write a custom Loader for such a situation, but
> why do you need to load everything in one pass? You can load different
> types of files separately (having multiple LOAD statements) and make a
> join or a union afterwards.
> 
> Ruslan
> 
> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
> <jo...@adition.com> wrote:
>> Hi all,
>>
>> is it possible to have an input path (as parameter to a LOAD statement)
>> that contains several files in *different formats* - say serialized Avro
>> data and tab separated values and make pig read the data into one alias?
>> I guess I have to write an UDF for this? How should I start, can you
>> sketch out a rough plan on how to proceed?
>>
>>
>> Greetings,
>> Johannes Schwenk
>>
>> --
>> Softwareentwickler (Reporting)
>> ________________________________________________________
>>
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>>
>> http://www.adition.com
>>
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>>
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>>
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434
>>
> 
> 
> 



Johannes Schwenk

-- 
Softwareentwickler (Reporting)
________________________________________________________

ADITION technologies AG
Schwarzwaldstraße 78b
79117 Freiburg

http://www.adition.com

T +49 / (0)761 / 88147 - 30
F +49 / (0)761 / 88147 - 77
SUPPORT +49  / (0)1805 - ADITION

(Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)

Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
UStIDNr.: DE 218 858 434


Re: Mixed input formats in LOAD path

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.
Hi Johannes,

I guess you'd have to write a custom Loader for such a situation, but
why do you need to load everything in one pass? You can load different
types of files separately (having multiple LOAD statements) and make a
join or a union afterwards.

Ruslan

On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk
<jo...@adition.com> wrote:
> Hi all,
>
> is it possible to have an input path (as parameter to a LOAD statement)
> that contains several files in *different formats* - say serialized Avro
> data and tab separated values and make pig read the data into one alias?
> I guess I have to write an UDF for this? How should I start, can you
> sketch out a rough plan on how to proceed?
>
>
> Greetings,
> Johannes Schwenk
>
> --
> Softwareentwickler (Reporting)
> ________________________________________________________
>
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
>
> http://www.adition.com
>
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
>
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434
>



-- 
Best Regards,
Ruslan Al-Fakikh