You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yulia Tolskaya <yu...@magnetic.is> on 2012/01/09 07:45:00 UTC

Loading several files

Hello, 
I am wondering if there is a way for me to load multiple files into pig, while  still keeping track of what record came from what file. To give some background, I have about half a million files of one phrase per line, and I need to note which document each phrase belongs to. 

Thanks for your help!
Yulia

Re: Loading several files

Posted by Yulia Tolskaya <yu...@magnetic.is>.
I think my problem might have to do with this bug:
https://issues.apache.org/jira/browse/PIG-2462
As the code of the loader uses getWrappedSplit()


On Jan 9, 2012, at 2:24 PM, Yulia Tolskaya wrote:

yep!

On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:

Did you set "pig.splitCombination" to false?

On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is>> wrote:

Thank you for your response!
I am trying to use the Loader you have suggested, and I keep running into
problems. For some reason I keep getting the same file name for all files
in the folder. I do not understand why this is happing!

Yulia

Yulia
On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:

Check

https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F

Daniel

On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>>
wrote:

Hello,
I am wondering if there is a way for me to load multiple files into pig,
while  still keeping track of what record came from what file. To give
some
background, I have about half a million files of one phrase per line,
and I
need to note which document each phrase belongs to.

Thanks for your help!
Yulia





Re: Loading several files

Posted by Yulia Tolskaya <yu...@magnetic.is>.
Hi Daniel, 
It does work without setting splitCombination to false now.

Thank you for all the help!
Yulia
On 1/13/12 3:19 AM, "Daniel Dai" <da...@hortonworks.com> wrote:

>Hi, Yulia,
>I don't know what happen, but after
>https://issues.apache.org/jira/browse/PIG-2462, you don't need to disable
>splitCombination to make the code working. It might be related to your
>issue. We will commit this patch soon, and it should be part of 0.9.2
>release. It would be great if you will try again and give us feedback.
>
>Thanks,
>Daniel
>
>On Mon, Jan 9, 2012 at 11:32 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:
>
>> It is a particular file in the directory;
>> On Jan 9, 2012, at 2:27 PM, Daniel Dai wrote:
>>
>> > Which path did you get? Directory or a particular file in the
>>directory?
>> >
>> > On Mon, Jan 9, 2012 at 11:24 AM, Yulia Tolskaya <yu...@magnetic.is>
>> wrote:
>> >
>> >> yep!
>> >>
>> >> On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:
>> >>
>> >>> Did you set "pig.splitCombination" to false?
>> >>>
>> >>> On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is>
>> >> wrote:
>> >>>
>> >>>> Thank you for your response!
>> >>>> I am trying to use the Loader you have suggested, and I keep
>>running
>> >> into
>> >>>> problems. For some reason I keep getting the same file name for all
>> >> files
>> >>>> in the folder. I do not understand why this is happing!
>> >>>>
>> >>>> Yulia
>> >>>>
>> >>>> Yulia
>> >>>> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
>> >>>>
>> >>>>> Check
>> >>>>>
>> >>>>
>> >>
>> 
>>https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafrom
>>adirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3
>>F
>> >>>>>
>> >>>>> Daniel
>> >>>>>
>> >>>>> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya
>><yu...@magnetic.is>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Hello,
>> >>>>>> I am wondering if there is a way for me to load multiple files
>>into
>> >> pig,
>> >>>>>> while  still keeping track of what record came from what file. To
>> give
>> >>>> some
>> >>>>>> background, I have about half a million files of one phrase per
>> line,
>> >>>> and I
>> >>>>>> need to note which document each phrase belongs to.
>> >>>>>>
>> >>>>>> Thanks for your help!
>> >>>>>> Yulia
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>


Re: Loading several files

Posted by Daniel Dai <da...@hortonworks.com>.
Hi, Yulia,
I don't know what happen, but after
https://issues.apache.org/jira/browse/PIG-2462, you don't need to disable
splitCombination to make the code working. It might be related to your
issue. We will commit this patch soon, and it should be part of 0.9.2
release. It would be great if you will try again and give us feedback.

Thanks,
Daniel

On Mon, Jan 9, 2012 at 11:32 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:

> It is a particular file in the directory;
> On Jan 9, 2012, at 2:27 PM, Daniel Dai wrote:
>
> > Which path did you get? Directory or a particular file in the directory?
> >
> > On Mon, Jan 9, 2012 at 11:24 AM, Yulia Tolskaya <yu...@magnetic.is>
> wrote:
> >
> >> yep!
> >>
> >> On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:
> >>
> >>> Did you set "pig.splitCombination" to false?
> >>>
> >>> On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is>
> >> wrote:
> >>>
> >>>> Thank you for your response!
> >>>> I am trying to use the Loader you have suggested, and I keep running
> >> into
> >>>> problems. For some reason I keep getting the same file name for all
> >> files
> >>>> in the folder. I do not understand why this is happing!
> >>>>
> >>>> Yulia
> >>>>
> >>>> Yulia
> >>>> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
> >>>>
> >>>>> Check
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
> >>>>>
> >>>>> Daniel
> >>>>>
> >>>>> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>
> >>>> wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>> I am wondering if there is a way for me to load multiple files into
> >> pig,
> >>>>>> while  still keeping track of what record came from what file. To
> give
> >>>> some
> >>>>>> background, I have about half a million files of one phrase per
> line,
> >>>> and I
> >>>>>> need to note which document each phrase belongs to.
> >>>>>>
> >>>>>> Thanks for your help!
> >>>>>> Yulia
> >>>>
> >>>>
> >>
> >>
>
>

Re: Loading several files

Posted by Yulia Tolskaya <yu...@magnetic.is>.
It is a particular file in the directory;
On Jan 9, 2012, at 2:27 PM, Daniel Dai wrote:

> Which path did you get? Directory or a particular file in the directory?
> 
> On Mon, Jan 9, 2012 at 11:24 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:
> 
>> yep!
>> 
>> On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:
>> 
>>> Did you set "pig.splitCombination" to false?
>>> 
>>> On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is>
>> wrote:
>>> 
>>>> Thank you for your response!
>>>> I am trying to use the Loader you have suggested, and I keep running
>> into
>>>> problems. For some reason I keep getting the same file name for all
>> files
>>>> in the folder. I do not understand why this is happing!
>>>> 
>>>> Yulia
>>>> 
>>>> Yulia
>>>> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
>>>> 
>>>>> Check
>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
>>>>> 
>>>>> Daniel
>>>>> 
>>>>> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>
>>>> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> I am wondering if there is a way for me to load multiple files into
>> pig,
>>>>>> while  still keeping track of what record came from what file. To give
>>>> some
>>>>>> background, I have about half a million files of one phrase per line,
>>>> and I
>>>>>> need to note which document each phrase belongs to.
>>>>>> 
>>>>>> Thanks for your help!
>>>>>> Yulia
>>>> 
>>>> 
>> 
>> 


Re: Loading several files

Posted by Daniel Dai <da...@hortonworks.com>.
Which path did you get? Directory or a particular file in the directory?

On Mon, Jan 9, 2012 at 11:24 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:

> yep!
>
> On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:
>
> > Did you set "pig.splitCombination" to false?
> >
> > On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is>
> wrote:
> >
> >> Thank you for your response!
> >> I am trying to use the Loader you have suggested, and I keep running
> into
> >> problems. For some reason I keep getting the same file name for all
> files
> >> in the folder. I do not understand why this is happing!
> >>
> >> Yulia
> >>
> >> Yulia
> >> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
> >>
> >>> Check
> >>>
> >>
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
> >>>
> >>> Daniel
> >>>
> >>> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>
> >> wrote:
> >>>
> >>>> Hello,
> >>>> I am wondering if there is a way for me to load multiple files into
> pig,
> >>>> while  still keeping track of what record came from what file. To give
> >> some
> >>>> background, I have about half a million files of one phrase per line,
> >> and I
> >>>> need to note which document each phrase belongs to.
> >>>>
> >>>> Thanks for your help!
> >>>> Yulia
> >>
> >>
>
>

Re: Loading several files

Posted by Yulia Tolskaya <yu...@magnetic.is>.
yep!

On Jan 9, 2012, at 2:09 PM, Daniel Dai wrote:

> Did you set "pig.splitCombination" to false?
> 
> On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:
> 
>> Thank you for your response!
>> I am trying to use the Loader you have suggested, and I keep running into
>> problems. For some reason I keep getting the same file name for all files
>> in the folder. I do not understand why this is happing!
>> 
>> Yulia
>> 
>> Yulia
>> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
>> 
>>> Check
>>> 
>> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
>>> 
>>> Daniel
>>> 
>>> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>
>> wrote:
>>> 
>>>> Hello,
>>>> I am wondering if there is a way for me to load multiple files into pig,
>>>> while  still keeping track of what record came from what file. To give
>> some
>>>> background, I have about half a million files of one phrase per line,
>> and I
>>>> need to note which document each phrase belongs to.
>>>> 
>>>> Thanks for your help!
>>>> Yulia
>> 
>> 


Re: Loading several files

Posted by Daniel Dai <da...@hortonworks.com>.
Did you set "pig.splitCombination" to false?

On Mon, Jan 9, 2012 at 10:38 AM, Yulia Tolskaya <yu...@magnetic.is> wrote:

> Thank you for your response!
> I am trying to use the Loader you have suggested, and I keep running into
> problems. For some reason I keep getting the same file name for all files
> in the folder. I do not understand why this is happing!
>
> Yulia
>
> Yulia
> On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:
>
> > Check
> >
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
> >
> > Daniel
> >
> > On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is>
> wrote:
> >
> >> Hello,
> >> I am wondering if there is a way for me to load multiple files into pig,
> >> while  still keeping track of what record came from what file. To give
> some
> >> background, I have about half a million files of one phrase per line,
> and I
> >> need to note which document each phrase belongs to.
> >>
> >> Thanks for your help!
> >> Yulia
>
>

Re: Loading several files

Posted by Yulia Tolskaya <yu...@magnetic.is>.
Thank you for your response!
I am trying to use the Loader you have suggested, and I keep running into problems. For some reason I keep getting the same file name for all files in the folder. I do not understand why this is happing!

Yulia

Yulia
On Jan 9, 2012, at 1:57 AM, Daniel Dai wrote:

> Check
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
> 
> Daniel
> 
> On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is> wrote:
> 
>> Hello,
>> I am wondering if there is a way for me to load multiple files into pig,
>> while  still keeping track of what record came from what file. To give some
>> background, I have about half a million files of one phrase per line, and I
>> need to note which document each phrase belongs to.
>> 
>> Thanks for your help!
>> Yulia


Re: Loading several files

Posted by Daniel Dai <da...@hortonworks.com>.
Check
https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F

Daniel

On Sun, Jan 8, 2012 at 10:45 PM, Yulia Tolskaya <yu...@magnetic.is> wrote:

> Hello,
> I am wondering if there is a way for me to load multiple files into pig,
> while  still keeping track of what record came from what file. To give some
> background, I have about half a million files of one phrase per line, and I
> need to note which document each phrase belongs to.
>
> Thanks for your help!
> Yulia