You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thomas Bach <th...@students.uni-mainz.de> on 2012/12/18 21:00:15 UTC
Limit number of Streaming Programs
Hi,
I have around 4 million time series. ~1000 of them had a special
occurrence at some point. Now, I want to draw 10 samples for each
special time-series based on a similarity comparison.
What I have currently implemented is a script in Python which consumes
time-series one-by-one and does a comparison with all 1000 special
time-series. If the similarity is sufficient with one of them I pass
it back to Pig and strike out the according special time-series,
subsequent time-series will not be compared against this one.
This routine runs, but it lasts around 6 hours.
One of the problems I'm facing is that Pig starts >160 scripts
although 10 would be sufficient. Is there some way to define the
number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
set default_parallel to 10, but it doesn't seem to have any effect.
I'm also open to any other ideas on how to accomplish the task.
Regards,
Thomas Bach.
Re: Limit number of Streaming Programs
Posted by Prasanth J <bu...@gmail.com>.
Hi Kshiva
There are several pig latin plugins for different IDEs/Editors. Checkout https://cwiki.apache.org/PIG/pigtools.html
Thanks
-- Prasanth
On Dec 25, 2012, at 11:09 AM, Kshiva Kps <ks...@gmail.com> wrote:
> Hi,
>
> Is there any PIG editors and where we can write 100 to 150 pig scripts
> I'm believing is not possible to do in CLI mode .
> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>
> Thnaks
>
>
> On Tue, Dec 25, 2012 at 3:45 AM, Cheolsoo Park <ch...@cloudera.com>wrote:
>
>> Hi Thomas,
>>
>> If I understand your question correctly, what you want is reduce the number
>> of mappers that spawn streaming processes. The default-parallel controls
>> the number of reducers, so it won't have any effect to the number of
>> mappers. Although the number of mappers is auto-determined by the size of
>> input data, you can try to set "pig.maxCombinedSplitSize" to combine input
>> files into bigger ones. For more details, please refer to:
>> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
>>
>> You can also read a discussion on a similar topic here:
>>
>> http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+
>>
>> Thanks,
>> Cheolsoo
>>
>>
>> On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
>> <th...@students.uni-mainz.de>wrote:
>>
>>> Hi,
>>>
>>> I have around 4 million time series. ~1000 of them had a special
>>> occurrence at some point. Now, I want to draw 10 samples for each
>>> special time-series based on a similarity comparison.
>>>
>>> What I have currently implemented is a script in Python which consumes
>>> time-series one-by-one and does a comparison with all 1000 special
>>> time-series. If the similarity is sufficient with one of them I pass
>>> it back to Pig and strike out the according special time-series,
>>> subsequent time-series will not be compared against this one.
>>>
>>> This routine runs, but it lasts around 6 hours.
>>>
>>> One of the problems I'm facing is that Pig starts >160 scripts
>>> although 10 would be sufficient. Is there some way to define the
>>> number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
>>> set default_parallel to 10, but it doesn't seem to have any effect.
>>>
>>> I'm also open to any other ideas on how to accomplish the task.
>>>
>>> Regards,
>>> Thomas Bach.
>>>
>>
Re: Limit number of Streaming Programs
Posted by Mohammad Tariq <do...@gmail.com>.
Folks on the list need some time mate. I have specified a couple of links
on the other thread of yours. Check it out and see if it helps.
Best Regards,
Tariq
+91-9741563634
https://mtariq.jux.com/
On Tue, Dec 25, 2012 at 11:09 AM, Kshiva Kps <ks...@gmail.com> wrote:
> Hi,
>
> Is there any PIG editors and where we can write 100 to 150 pig scripts
> I'm believing is not possible to do in CLI mode .
> Like IDE for JAVA /TOAD for SQL pls advice , many thanks
>
> Thnaks
>
>
> On Tue, Dec 25, 2012 at 3:45 AM, Cheolsoo Park <cheolsoo@cloudera.com
> >wrote:
>
> > Hi Thomas,
> >
> > If I understand your question correctly, what you want is reduce the
> number
> > of mappers that spawn streaming processes. The default-parallel controls
> > the number of reducers, so it won't have any effect to the number of
> > mappers. Although the number of mappers is auto-determined by the size of
> > input data, you can try to set "pig.maxCombinedSplitSize" to combine
> input
> > files into bigger ones. For more details, please refer to:
> > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> >
> > You can also read a discussion on a similar topic here:
> >
> >
> http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+
> >
> > Thanks,
> > Cheolsoo
> >
> >
> > On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
> > <th...@students.uni-mainz.de>wrote:
> >
> > > Hi,
> > >
> > > I have around 4 million time series. ~1000 of them had a special
> > > occurrence at some point. Now, I want to draw 10 samples for each
> > > special time-series based on a similarity comparison.
> > >
> > > What I have currently implemented is a script in Python which consumes
> > > time-series one-by-one and does a comparison with all 1000 special
> > > time-series. If the similarity is sufficient with one of them I pass
> > > it back to Pig and strike out the according special time-series,
> > > subsequent time-series will not be compared against this one.
> > >
> > > This routine runs, but it lasts around 6 hours.
> > >
> > > One of the problems I'm facing is that Pig starts >160 scripts
> > > although 10 would be sufficient. Is there some way to define the
> > > number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
> > > set default_parallel to 10, but it doesn't seem to have any effect.
> > >
> > > I'm also open to any other ideas on how to accomplish the task.
> > >
> > > Regards,
> > > Thomas Bach.
> > >
> >
>
Re: Limit number of Streaming Programs
Posted by Kshiva Kps <ks...@gmail.com>.
Hi,
Is there any PIG editors and where we can write 100 to 150 pig scripts
I'm believing is not possible to do in CLI mode .
Like IDE for JAVA /TOAD for SQL pls advice , many thanks
Thnaks
On Tue, Dec 25, 2012 at 3:45 AM, Cheolsoo Park <ch...@cloudera.com>wrote:
> Hi Thomas,
>
> If I understand your question correctly, what you want is reduce the number
> of mappers that spawn streaming processes. The default-parallel controls
> the number of reducers, so it won't have any effect to the number of
> mappers. Although the number of mappers is auto-determined by the size of
> input data, you can try to set "pig.maxCombinedSplitSize" to combine input
> files into bigger ones. For more details, please refer to:
> http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
>
> You can also read a discussion on a similar topic here:
>
> http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+
>
> Thanks,
> Cheolsoo
>
>
> On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
> <th...@students.uni-mainz.de>wrote:
>
> > Hi,
> >
> > I have around 4 million time series. ~1000 of them had a special
> > occurrence at some point. Now, I want to draw 10 samples for each
> > special time-series based on a similarity comparison.
> >
> > What I have currently implemented is a script in Python which consumes
> > time-series one-by-one and does a comparison with all 1000 special
> > time-series. If the similarity is sufficient with one of them I pass
> > it back to Pig and strike out the according special time-series,
> > subsequent time-series will not be compared against this one.
> >
> > This routine runs, but it lasts around 6 hours.
> >
> > One of the problems I'm facing is that Pig starts >160 scripts
> > although 10 would be sufficient. Is there some way to define the
> > number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
> > set default_parallel to 10, but it doesn't seem to have any effect.
> >
> > I'm also open to any other ideas on how to accomplish the task.
> >
> > Regards,
> > Thomas Bach.
> >
>
Re: Limit number of Streaming Programs
Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi Thomas,
If I understand your question correctly, what you want is reduce the number
of mappers that spawn streaming processes. The default-parallel controls
the number of reducers, so it won't have any effect to the number of
mappers. Although the number of mappers is auto-determined by the size of
input data, you can try to set "pig.maxCombinedSplitSize" to combine input
files into bigger ones. For more details, please refer to:
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
You can also read a discussion on a similar topic here:
http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+
Thanks,
Cheolsoo
On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
<th...@students.uni-mainz.de>wrote:
> Hi,
>
> I have around 4 million time series. ~1000 of them had a special
> occurrence at some point. Now, I want to draw 10 samples for each
> special time-series based on a similarity comparison.
>
> What I have currently implemented is a script in Python which consumes
> time-series one-by-one and does a comparison with all 1000 special
> time-series. If the similarity is sufficient with one of them I pass
> it back to Pig and strike out the according special time-series,
> subsequent time-series will not be compared against this one.
>
> This routine runs, but it lasts around 6 hours.
>
> One of the problems I'm facing is that Pig starts >160 scripts
> although 10 would be sufficient. Is there some way to define the
> number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
> set default_parallel to 10, but it doesn't seem to have any effect.
>
> I'm also open to any other ideas on how to accomplish the task.
>
> Regards,
> Thomas Bach.
>