You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Haitao Yao <ya...@gmail.com> on 2012/08/09 16:51:25 UTC

Re: What is the best way to do counting in pig?

Hey, all, I've submitted the patch for PIG-2182, here's the link: https://issues.apache.org/jira/browse/PIG-2812

I didn't change the data bags to spill into only one file, since that's a very big modification. I just let the DefaultAbstractDataBag spill into one directory and delete the directory recursively with ShutdownHook.




Haitao Yao
yao.erix@gmail.com
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-12,上午11:04, Haitao Yao 写道:

> Sorry. here's the full mail. 
> 
> 
> > Is your query using combiner ?
> 	I did know how to explicitly use combiner.
> 
> > Can you send the explain plan output ?
> 	The explain result is in the attachment. It's a little long. link: http://pastebin.com/Q6CvKiP1
> 	
> <aa.explain>
> 
> 
> > Does the heap information say how many entries are there in the
> InteralCachedBag's ArrayList ?
> 	There's 6 big Array lists, and the size is about 372692
> 	Here's the screen snapshot of the heap dump:
> 
> 	screen snapshot 1: you can see there's 6 big POForeEach instances
> 	
> <aa.jpg>
> 
> 
> 		screen snapshot 2: you can see the memory are mostly retained by the big array list.
> 		
> <bb.jpg>
> 
> 
> 		screen snapshot 3: you can see the big array list is referenced by InternalCachedBag: 
> 			
> <cc.jpg>
> 
> 	
> > What version of pig are you using?
> 	pig-0.9.2, I've read the latest source code of pig from github, and I don't find any improvements on IntercalCachedBag
> 
> 
> 
> 在 2012-7-12,上午10:58, Jonathan Coveney 写道:
> 
>> the listserv strips attachments. you'll have to host it somewhere else and
>> link it
>> 
>> 2012/7/11 Haitao Yao <ya...@gmail.com>
>> 
>>> Sorry , I sent the mail only to Thejas.
>>> 
>>> Resend it for all.
>>> 
>>> 
>>> Haitao Yao
>>> yao.erix@gmail.com
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 在 2012-7-12,上午10:41, Haitao Yao 写道:
>>> 
>>>> 
>>>> 
>>>>> Is your query using combiner ?
>>>>      I did know how to explicitly use combiner.
>>>> 
>>>>> Can you send the explain plan output ?
>>>>      The explain result is in the attachment. It's a little long.
>>>> 
>>>> <aa.explain>
>>>> 
>>>>> Does the heap information say how many entries are there in the
>>>> InteralCachedBag's ArrayList ?
>>>>      There's 6 big Array lists, and the size is about 372692
>>>>      Here's the screen snapshot of the heap dump:
>>>> 
>>>>      screen snapshot 1: you can see there's 6 big POForeEach instances
>>>> 
>>>> <aa.jpg>
>>>> 
>>>>              screen snapshot 2: you can see the memory are mostly
>>> retained by the big array list.
>>>> 
>>>> <bb.jpg>
>>>> 
>>>>              screen snapshot 3: you can see the big array list is
>>> referenced by InternalCachedBag:
>>>> 
>>>> <cc.jpg>
>>>> 
>>>>> What version of pig are you using?
>>>>      pig-0.9.2, I've read the latest source code of pig from github,
>>> and I don't find any improvements on IntercalCachedBag.
>>>> 
>>>> 
>>>> Haitao Yao
>>>> yao.erix@gmail.com
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>> 
>>>> 在 2012-7-12,上午8:56, Thejas Nair 写道:
>>>> 
>>>>> Haitao,
>>>>> Is your query using combiner ? Can you send the explain plan output ?
>>>>> Does the heap information say how many entries are there in the
>>>>> InteralCachedBag's ArrayList ?
>>>>> What version of pig are you using ?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Thejas
>>>>> 
>>>>> 
>>>>> On 7/10/12 11:50 PM, Haitao Yao wrote:
>>>>>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because
>>>>>> every time the InternalCachedBag spills, It creates a new tmp file in
>>>>>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple
>>>>>> added into InternalCachedBag will create a new tmp file. And the tmp
>>>>>> file is deleted on exit.
>>>>>> So , if you're unlucky like me, you will get a OOM Exception caused by
>>>>>> java.io.DeleteOnExitHook!
>>>>>> Here's the evidence:
>>>>>> 
>>>>>> God, we really need a full description of how every parameter works.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Haitao Yao
>>>>>> yao.erix@gmail.com <ma...@gmail.com>
>>>>>> weibo: @haitao_yao
>>>>>> Skype:  haitao.yao.final
>>>>>> 
>>>>>> 在 2012-7-10,下午4:20, Haitao Yao 写道:
>>>>>> 
>>>>>>> I found the solution.
>>>>>>> 
>>>>>>> After analyzing the heap dump while the reducer OOM, I found out the
>>>>>>> memory is consumed by org.apache.pig.data.InternalCachedBag , here's
>>>>>>> the diagram:
>>>>>>> <cc.jpg>
>>>>>>> 
>>>>>>> In the source code of org.apache.pig.data.InternalCachedBag, I found
>>>>>>> out there's a parameter for the cache limit:
>>>>>>> *public* InternalCachedBag(*int* bagCount) {
>>>>>>> *float* percent = 0.2F;
>>>>>>> 
>>>>>>> *if* (PigMapReduce./sJobConfInternal/.get() != *null*) {
>>>>>>> // here, the cache limit is from here!
>>>>>>> String usage =
>>>>>>> PigMapReduce./sJobConfInternal/.get().get("pig.cachedbag.memusage");
>>>>>>> *if* (usage != *null*) {
>>>>>>> percent = Float./parseFloat/(usage);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>>       init(bagCount, percent);
>>>>>>>   }
>>>>>>> *private* *void* init(*int* bagCount, *float* percent) {
>>>>>>> factory = TupleFactory./getInstance/();
>>>>>>> mContents = *new* ArrayList<Tuple>();
>>>>>>> 
>>>>>>> *long* max = Runtime./getRuntime/().maxMemory();
>>>>>>> maxMemUsage = (*long*)(((*float*)max * percent) / (*float*)bagCount);
>>>>>>> cacheLimit = Integer./MAX_VALUE/;
>>>>>>> 
>>>>>>> // set limit to 0, if memusage is 0 or really really small.
>>>>>>> // then all tuples are put into disk
>>>>>>> *if* (maxMemUsage < 1) {
>>>>>>> cacheLimit = 0;
>>>>>>>       }
>>>>>>> /log/.warn("cacheLimit: " + *this*.cacheLimit);
>>>>>>> addDone = *false*;
>>>>>>>   }
>>>>>>> 
>>>>>>> so, after write pig.cachedbag.memusage=0 into
>>>>>>> $PIG_HOME/conf/pig.properties, my job successes!
>>>>>>> 
>>>>>>> You can also set to an appropriate value to fully utilize your memory
>>>>>>> as a cache.
>>>>>>> 
>>>>>>> Hope this is useful for others.
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> 
>>>>>>> Haitao Yao
>>>>>>> yao.erix@gmail.com <ma...@gmail.com>
>>>>>>> weibo: @haitao_yao
>>>>>>> Skype:  haitao.yao.final
>>>>>>> 
>>>>>>> 在 2012-7-10,下午1:06, Haitao Yao 写道:
>>>>>>> 
>>>>>>>> my reducers get 512 MB, -Xms512M -Xmx512M.
>>>>>>>> The reducer does not get OOM when manually invoke spill in my case.
>>>>>>>> 
>>>>>>>> Can you explain more about your solution?
>>>>>>>> And can your solution fit into 512MB reducer process?
>>>>>>>> Thanks very much.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Haitao Yao
>>>>>>>> yao.erix@gmail.com <ma...@gmail.com>
>>>>>>>> weibo: @haitao_yao
>>>>>>>> Skype:  haitao.yao.final
>>>>>>>> 
>>>>>>>> 在 2012-7-10,下午12:26, Jonathan Coveney 写道:
>>>>>>>> 
>>>>>>>>> I have something in the mix that should reduce bag memory :)
>>>>>>>>> Question: how
>>>>>>>>> much memory are your reducers getting? In my experience, you'll get
>>>>>>>>> OOM's
>>>>>>>>> on spilling if you have allocated less than a gig to the JVM
>>>>>>>>> 
>>>>>>>>> 2012/7/9 Haitao Yao <yao.erix@gmail.com <mailto:yao.erix@gmail.com
>>>>> 
>>>>>>>>> 
>>>>>>>>>> I have encountered the similar problem.  And I got a OOM while
>>>>>>>>>> running the
>>>>>>>>>> reducer.
>>>>>>>>>> I think the reason is the data bag generated after group all is too
>>>>>>>>>> big to
>>>>>>>>>> fit into the reducer's memory.
>>>>>>>>>> 
>>>>>>>>>> and I have written a new COUNT implementation with explicit invoke
>>>>>>>>>> System.gc() and spill  after the COUNT function finish its job,
>>> but it
>>>>>>>>>> still get OOM
>>>>>>>>>> 
>>>>>>>>>> here's the code of the new COUNT implementation:
>>>>>>>>>>      @Override
>>>>>>>>>>      public Long exec(Tuple input) throws IOException {
>>>>>>>>>>              DataBag bag = (DataBag)input.get(0);
>>>>>>>>>>              Long result = super.exec(input);
>>>>>>>>>>              LOG.warn(" before spill data bag memory : " +
>>>>>>>>>> Runtime.getRuntime().freeMemory());
>>>>>>>>>>              bag.spill();
>>>>>>>>>>              System.gc();
>>>>>>>>>>              LOG.warn(" after spill data bag memory : " +
>>>>>>>>>> Runtime.getRuntime().freeMemory());
>>>>>>>>>>              LOG.warn("big bag size: " + bag.size() + ",
>>>>>>>>>> hashcode: " +
>>>>>>>>>> bag.hashCode());
>>>>>>>>>>              return result;
>>>>>>>>>>      }
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I think we have to redesign the data bag implementation with less
>>>>>>>>>> memory
>>>>>>>>>> consumed.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Haitao Yao
>>>>>>>>>> yao.erix@gmail.com <ma...@gmail.com>
>>>>>>>>>> weibo: @haitao_yao
>>>>>>>>>> Skype:  haitao.yao.final
>>>>>>>>>> 
>>>>>>>>>> 在 2012-7-10,上午6:54, Sheng Guo 写道:
>>>>>>>>>> 
>>>>>>>>>>> the pig script:
>>>>>>>>>>> 
>>>>>>>>>>> longDesc = load '/user/xx/filtered_chunk' USING AvroStorage();
>>>>>>>>>>> 
>>>>>>>>>>> grpall = group longDesc all;
>>>>>>>>>>> cnt = foreach grpall generate COUNT(longDesc) as allNumber;
>>>>>>>>>>> explain cnt;
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> the dump relation result:
>>>>>>>>>>> 
>>>>>>>>>>> #-----------------------------------------------
>>>>>>>>>>> # New Logical Plan:
>>>>>>>>>>> #-----------------------------------------------
>>>>>>>>>>> cnt: (Name: LOStore Schema: allNumber#65:long)
>>>>>>>>>>> |
>>>>>>>>>>> |---cnt: (Name: LOForEach Schema: allNumber#65:long)
>>>>>>>>>>> |   |
>>>>>>>>>>> |   (Name: LOGenerate[false] Schema:
>>>>>>>>>>> 
>>> allNumber#65:long)ColumnPrune:InputUids=[63]ColumnPrune:OutputUids=[65]
>>>>>>>>>>> |   |   |
>>>>>>>>>>> |   |   (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long
>>>>>>>>>>> Uid:
>>>>>>>>>>> 65)
>>>>>>>>>>> |   |   |
>>>>>>>>>>> |   |   |---longDesc:(Name: Project Type: bag Uid: 63 Input: 0
>>>>>>>>>>> Column:
>>>>>>>>>>> (*))
>>>>>>>>>>> |   |
>>>>>>>>>>> |   |---longDesc: (Name: LOInnerLoad[1] Schema:
>>>>>>>>>>> 
>>>>>>>>>> 
>>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)
>>>>>>>>>>> |
>>>>>>>>>>> |---grpall: (Name: LOCogroup Schema:
>>>>>>>>>>> 
>>>>>>>>>> 
>>> group#62:chararray,longDesc#63:bag{#64:tuple(DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)})
>>>>>>>>>>>     |   |
>>>>>>>>>>>     |   (Name: Constant Type: chararray Uid: 62)
>>>>>>>>>>>     |
>>>>>>>>>>>     |---longDesc: (Name: LOLoad Schema:
>>>>>>>>>>> 
>>>>>>>>>> 
>>> DISCUSSION_ID#41:long,COMMENT_COUNT#42:long,UNIQUE_COMMENTER_COUNT#43:long,ACTIVE_COMMENT_COUNT#44:long,LAST_ACTIVITY_AT#45:long,SUBJECT#46:chararray,SUBJECT_CHUNKS#47:chararray,LOCALE#48:chararray,STATE#49:chararray,DETAIL#50:chararray,DETAIL_CHUNKS#51:chararray,TOPIC_TITLE#52:chararray,TOPIC_TITLE_CHUNKS#53:chararray,TOPIC_DESCRIPTION#54:chararray,TOPIC_DESCRIPTION_CHUNKS#55:chararray,TOPIC_ATTRIBUTES#56:chararray)RequiredFields:null
>>>>>>>>>>> 
>>>>>>>>>>> #-----------------------------------------------
>>>>>>>>>>> # Physical Plan:
>>>>>>>>>>> #-----------------------------------------------
>>>>>>>>>>> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
>>>>>>>>>>> |
>>>>>>>>>>> |---cnt: New For Each(false)[bag] - scope-8
>>>>>>>>>>> |   |
>>>>>>>>>>> |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - scope-6
>>>>>>>>>>> |   |
>>>>>>>>>>> |   |---Project[bag][1] - scope-5
>>>>>>>>>>> |
>>>>>>>>>>> |---grpall: Package[tuple]{chararray} - scope-2
>>>>>>>>>>>     |
>>>>>>>>>>>     |---grpall: Global Rearrange[tuple] - scope-1
>>>>>>>>>>>         |
>>>>>>>>>>>         |---grpall: Local Rearrange[tuple]{chararray}(false) -
>>>>>>>>>>> scope-3
>>>>>>>>>>>             |   |
>>>>>>>>>>>             |   Constant(all) - scope-4
>>>>>>>>>>>             |
>>>>>>>>>>>             |---longDesc:
>>>>>>>>>>> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) - scope-0
>>>>>>>>>>> 
>>>>>>>>>>> 2012-07-09 15:47:02,441 [main] INFO
>>>>>>>>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
>>>>>>>>>>> -
>>>>>>>>>>> File concatenation threshold: 100 optimistic? false
>>>>>>>>>>> 2012-07-09 15:47:02,448 [main] INFO
>>>>>>>>>>> 
>>>>>>>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
>>>>>>>>>>> - Choosing to move algebraic foreach to combiner
>>>>>>>>>>> 2012-07-09 15:47:02,581 [main] INFO
>>>>>>>>>>> 
>>>>>>>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>>>>>>>> - MR plan size before optimization: 1
>>>>>>>>>>> 2012-07-09 15:47:02,581 [main] INFO
>>>>>>>>>>> 
>>>>>>>>>> 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>>>>>>>>> - MR plan size after optimization: 1
>>>>>>>>>>> #--------------------------------------------------
>>>>>>>>>>> # Map Reduce Plan
>>>>>>>>>>> #--------------------------------------------------
>>>>>>>>>>> MapReduce node scope-10
>>>>>>>>>>> Map Plan
>>>>>>>>>>> grpall: Local Rearrange[tuple]{chararray}(false) - scope-22
>>>>>>>>>>> |   |
>>>>>>>>>>> |   Project[chararray][0] - scope-23
>>>>>>>>>>> |
>>>>>>>>>>> |---cnt: New For Each(false,false)[bag] - scope-11
>>>>>>>>>>> |   |
>>>>>>>>>>> |   Project[chararray][0] - scope-12
>>>>>>>>>>> |   |
>>>>>>>>>>> |   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] -
>>>>>>>>>>> scope-13
>>>>>>>>>>> |   |
>>>>>>>>>>> |   |---Project[bag][1] - scope-14
>>>>>>>>>>> |
>>>>>>>>>>> |---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-24
>>>>>>>>>>>     |
>>>>>>>>>>>     |---longDesc:
>>>>>>>>>>> Load(/user/sguo/h2o/group_filtered_chunk:LiAvroStorage) -
>>>>>>>>>>> scope-0--------
>>>>>>>>>>> Combine Plan
>>>>>>>>>>> grpall: Local Rearrange[tuple]{chararray}(false) - scope-26
>>>>>>>>>>> |   |
>>>>>>>>>>> |   Project[chararray][0] - scope-27
>>>>>>>>>>> |
>>>>>>>>>>> |---cnt: New For Each(false,false)[bag] - scope-15
>>>>>>>>>>> |   |
>>>>>>>>>>> |   Project[chararray][0] - scope-16
>>>>>>>>>>> |   |
>>>>>>>>>>> |   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple]
>>> -
>>>>>>>>>>> scope-17
>>>>>>>>>>> |   |
>>>>>>>>>>> |   |---Project[bag][1] - scope-18
>>>>>>>>>>> |
>>>>>>>>>>> |---POCombinerPackage[tuple]{chararray} - scope-20--------
>>>>>>>>>>> Reduce Plan
>>>>>>>>>>> cnt: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-9
>>>>>>>>>>> |
>>>>>>>>>>> |---cnt: New For Each(false)[bag] - scope-8
>>>>>>>>>>> |   |
>>>>>>>>>>> |   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] -
>>> scope-6
>>>>>>>>>>> |   |
>>>>>>>>>>> |   |---Project[bag][1] - scope-19
>>>>>>>>>>> |
>>>>>>>>>>> |---POCombinerPackage[tuple]{chararray} - scope-28--------
>>>>>>>>>>> Global sort: false
>>>>>>>>>>> ----------------
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Jul 3, 2012 at 9:56 AM, Jonathan Coveney
>>>>>>>>>>> <jcoveney@gmail.com <ma...@gmail.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> instead of doing "dump relation," do "explain relation" (then run
>>>>>>>>>>>> identically) and paste the output here. It will show whether the
>>>>>>>>>> combiner
>>>>>>>>>>>> is being used,
>>>>>>>>>>>> 
>>>>>>>>>>>> 2012/7/3 Ruslan Al-Fakikh <ruslan.al-fakikh@jalent.ru
>>>>>>>>>>>> <ma...@jalent.ru>>
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As it was said, COUNT is algebraic and should be fast, because
>>> it
>>>>>>>>>>>>> forces combiner. You should make sure that combiner is really
>>> used
>>>>>>>>>>>>> here. It can be disabled in some situations. I've encountered
>>> such
>>>>>>>>>>>>> situations many times when a job is tooo heavy in case no
>>>>>>>>>>>>> combiner is
>>>>>>>>>>>>> applied.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ruslan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Jul 3, 2012 at 1:35 AM, Subir S
>>>>>>>>>>>>> <subir.sasikumar@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Right!!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Since it is mentioned that job is hanging, wild guess is it
>>> must be
>>>>>>>>>>>>>> 'group all'. How can that be confirmed?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 7/3/12, Jonathan Coveney <jcoveney@gmail.com
>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>> group all uses a single reducer, but COUNT is algebraic, and
>>>>>>>>>>>>>>> as such,
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> use combiners, so it is generally quite fast.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2012/7/2 Subir S <subir.sasikumar@gmail.com
>>>>>>>>>>>>>>> <ma...@gmail.com>>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Group all - uses single reducer AFAIU. You can try to count
>>> per
>>>>>>>>>> group
>>>>>>>>>>>>>>>> and sum may be.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You may also try with COUNT_STAR to include NULL fields.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 7/3/12, Sheng Guo <enigmaguo@gmail.com
>>>>>>>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I used to use the following pig script to do the counting
>>> of the
>>>>>>>>>>>>>>>>> records.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> m_skill_group = group m_skills_filter by member_id;
>>>>>>>>>>>>>>>>> grpd = group m_skill_group all;
>>>>>>>>>>>>>>>>> cnt = foreach grpd generate COUNT(m_skill_group);
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> cnt_filter = limit cnt 10;
>>>>>>>>>>>>>>>>> dump cnt_filter;
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> but sometimes, when the records get larger, it takes lots of
>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> hang
>>>>>>>>>>>>>>>>> up, and or die.
>>>>>>>>>>>>>>>>> I thought counting should be simple enough, so what is the
>>>>>>>>>>>>>>>>> best way
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> do a
>>>>>>>>>>>>>>>>> counting in pig?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sheng
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Ruslan Al-Fakikh
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>