You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mix Nin <pi...@gmail.com> on 2013/05/13 19:51:48 UTC

Number of records in an HDFS file

Hello,

What is the bets way to get the count of records in an HDFS file generated
by a PIG script.

Thanks

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I am just spitballing here.

You might want to override the FileOutputFormatter's commit job method ,
which while committing the job , writes the value of the job output record
counter (I think there is a standard counter to give the number of records
outputted by the job) to a file in HDFS.

Not sure if we can plug a custom FOC to a pig workflow.

Another thing is , you can create a workflow statement in pig (in the same
pig script that we are taking about) to get the count of the final bag and
then store it in a file. Can you not ?

Thanks,
Rahul


On Mon, May 13, 2013 at 11:46 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <sh...@gmail.com>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <sh...@gmail.com>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <sh...@gmail.com>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <sh...@gmail.com>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

The count file will be a very small file, right? Once it is generated on
HDFS, you can automate its downloading or movement anywhere you want. This
should not take much time.

Regards,
Shahab


On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:

> Hi,
>
> The final count file should reside in local directory, but not in HDFS
> directory. The above scripts will store text file in HDFS directory.
> The count file would need to be sent to other team who do not work on HDFS.
>
> Thanks
>
>
>
> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> If it is just counting the no. of records in a file then how about having
>> a short 3 liner :
>> LOGS= LOAD 'log';
>> LOGS_GROUP= GROUP LOGS ALL;
>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>
>> It did the trick for me.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>> the existing script once the file has been generated.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Ok, let re modify my requirement. I should have specified in the
>>>> beginning itself.
>>>>
>>>> I need to get count of records in an HDFS file created by a PIG script
>>>> and the store the count in a text file. This should be done automatically
>>>> on a daily basis without manual intervention
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> How about the second approach , get the application/job id which the
>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>> that job from the JT.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> It is a text file.
>>>>>>
>>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>>> local directory?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> few pointers.
>>>>>>>
>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>> for avro data files you can use avro-tools.
>>>>>>>
>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>> generated by a PIG script.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

The count file will be a very small file, right? Once it is generated on
HDFS, you can automate its downloading or movement anywhere you want. This
should not take much time.

Regards,
Shahab


On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:

> Hi,
>
> The final count file should reside in local directory, but not in HDFS
> directory. The above scripts will store text file in HDFS directory.
> The count file would need to be sent to other team who do not work on HDFS.
>
> Thanks
>
>
>
> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> If it is just counting the no. of records in a file then how about having
>> a short 3 liner :
>> LOGS= LOAD 'log';
>> LOGS_GROUP= GROUP LOGS ALL;
>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>
>> It did the trick for me.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>> the existing script once the file has been generated.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Ok, let re modify my requirement. I should have specified in the
>>>> beginning itself.
>>>>
>>>> I need to get count of records in an HDFS file created by a PIG script
>>>> and the store the count in a text file. This should be done automatically
>>>> on a daily basis without manual intervention
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> How about the second approach , get the application/job id which the
>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>> that job from the JT.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> It is a text file.
>>>>>>
>>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>>> local directory?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> few pointers.
>>>>>>>
>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>> for avro data files you can use avro-tools.
>>>>>>>
>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>> generated by a PIG script.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

The count file will be a very small file, right? Once it is generated on
HDFS, you can automate its downloading or movement anywhere you want. This
should not take much time.

Regards,
Shahab


On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:

> Hi,
>
> The final count file should reside in local directory, but not in HDFS
> directory. The above scripts will store text file in HDFS directory.
> The count file would need to be sent to other team who do not work on HDFS.
>
> Thanks
>
>
>
> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> If it is just counting the no. of records in a file then how about having
>> a short 3 liner :
>> LOGS= LOAD 'log';
>> LOGS_GROUP= GROUP LOGS ALL;
>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>
>> It did the trick for me.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>> the existing script once the file has been generated.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Ok, let re modify my requirement. I should have specified in the
>>>> beginning itself.
>>>>
>>>> I need to get count of records in an HDFS file created by a PIG script
>>>> and the store the count in a text file. This should be done automatically
>>>> on a daily basis without manual intervention
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> How about the second approach , get the application/job id which the
>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>> that job from the JT.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> It is a text file.
>>>>>>
>>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>>> local directory?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> few pointers.
>>>>>>>
>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>> for avro data files you can use avro-tools.
>>>>>>>
>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>> generated by a PIG script.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

The count file will be a very small file, right? Once it is generated on
HDFS, you can automate its downloading or movement anywhere you want. This
should not take much time.

Regards,
Shahab


On Mon, May 13, 2013 at 2:58 PM, Mix Nin <pi...@gmail.com> wrote:

> Hi,
>
> The final count file should reside in local directory, but not in HDFS
> directory. The above scripts will store text file in HDFS directory.
> The count file would need to be sent to other team who do not work on HDFS.
>
> Thanks
>
>
>
> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> If it is just counting the no. of records in a file then how about having
>> a short 3 liner :
>> LOGS= LOAD 'log';
>> LOGS_GROUP= GROUP LOGS ALL;
>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>
>> It did the trick for me.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>> the existing script once the file has been generated.
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Ok, let re modify my requirement. I should have specified in the
>>>> beginning itself.
>>>>
>>>> I need to get count of records in an HDFS file created by a PIG script
>>>> and the store the count in a text file. This should be done automatically
>>>> on a daily basis without manual intervention
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> How about the second approach , get the application/job id which the
>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>> that job from the JT.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> It is a text file.
>>>>>>
>>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>>> local directory?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> few pointers.
>>>>>>>
>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>> for avro data files you can use avro-tools.
>>>>>>>
>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>> generated by a PIG script.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Hi,

The final count file should reside in local directory, but not in HDFS
directory. The above scripts will store text file in HDFS directory.
The count file would need to be sent to other team who do not work on HDFS.

Thanks



On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com> wrote:

> If it is just counting the no. of records in a file then how about having
> a short 3 liner :
> LOGS= LOAD 'log';
> LOGS_GROUP= GROUP LOGS ALL;
> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>
> It did the trick for me.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> Not terribly efficient but at the top of my head: GROUP ALL and then do a
>> COUNT (or COUNT (*). You can implement a follow-up script or add this in
>> the existing script once the file has been generated.
>>
>> Regards,
>> Shahab
>>
>>
>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Ok, let re modify my requirement. I should have specified in the
>>> beginning itself.
>>>
>>> I need to get count of records in an HDFS file created by a PIG script
>>> and the store the count in a text file. This should be done automatically
>>> on a daily basis without manual intervention
>>>
>>>
>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> How about the second approach , get the application/job id which the
>>>> pig creates and submits to cluster and then find the job output counter for
>>>> that job from the JT.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> It is a text file.
>>>>>
>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>> local directory?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> few pointers.
>>>>>>
>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>> for avro data files you can use avro-tools.
>>>>>>
>>>>>> or get the job that pig is generating , get the counters for that job
>>>>>> from the jt of your hadoop cluster.
>>>>>>
>>>>>> Thanks,
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>> generated by a PIG script.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Hi,

The final count file should reside in local directory, but not in HDFS
directory. The above scripts will store text file in HDFS directory.
The count file would need to be sent to other team who do not work on HDFS.

Thanks



On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com> wrote:

> If it is just counting the no. of records in a file then how about having
> a short 3 liner :
> LOGS= LOAD 'log';
> LOGS_GROUP= GROUP LOGS ALL;
> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>
> It did the trick for me.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> Not terribly efficient but at the top of my head: GROUP ALL and then do a
>> COUNT (or COUNT (*). You can implement a follow-up script or add this in
>> the existing script once the file has been generated.
>>
>> Regards,
>> Shahab
>>
>>
>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Ok, let re modify my requirement. I should have specified in the
>>> beginning itself.
>>>
>>> I need to get count of records in an HDFS file created by a PIG script
>>> and the store the count in a text file. This should be done automatically
>>> on a daily basis without manual intervention
>>>
>>>
>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> How about the second approach , get the application/job id which the
>>>> pig creates and submits to cluster and then find the job output counter for
>>>> that job from the JT.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> It is a text file.
>>>>>
>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>> local directory?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> few pointers.
>>>>>>
>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>> for avro data files you can use avro-tools.
>>>>>>
>>>>>> or get the job that pig is generating , get the counters for that job
>>>>>> from the jt of your hadoop cluster.
>>>>>>
>>>>>> Thanks,
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>> generated by a PIG script.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Hi,

The final count file should reside in local directory, but not in HDFS
directory. The above scripts will store text file in HDFS directory.
The count file would need to be sent to other team who do not work on HDFS.

Thanks



On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com> wrote:

> If it is just counting the no. of records in a file then how about having
> a short 3 liner :
> LOGS= LOAD 'log';
> LOGS_GROUP= GROUP LOGS ALL;
> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>
> It did the trick for me.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> Not terribly efficient but at the top of my head: GROUP ALL and then do a
>> COUNT (or COUNT (*). You can implement a follow-up script or add this in
>> the existing script once the file has been generated.
>>
>> Regards,
>> Shahab
>>
>>
>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Ok, let re modify my requirement. I should have specified in the
>>> beginning itself.
>>>
>>> I need to get count of records in an HDFS file created by a PIG script
>>> and the store the count in a text file. This should be done automatically
>>> on a daily basis without manual intervention
>>>
>>>
>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> How about the second approach , get the application/job id which the
>>>> pig creates and submits to cluster and then find the job output counter for
>>>> that job from the JT.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> It is a text file.
>>>>>
>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>> local directory?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> few pointers.
>>>>>>
>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>> for avro data files you can use avro-tools.
>>>>>>
>>>>>> or get the job that pig is generating , get the counters for that job
>>>>>> from the jt of your hadoop cluster.
>>>>>>
>>>>>> Thanks,
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>> generated by a PIG script.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Hi,

The final count file should reside in local directory, but not in HDFS
directory. The above scripts will store text file in HDFS directory.
The count file would need to be sent to other team who do not work on HDFS.

Thanks



On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <do...@gmail.com> wrote:

> If it is just counting the no. of records in a file then how about having
> a short 3 liner :
> LOGS= LOAD 'log';
> LOGS_GROUP= GROUP LOGS ALL;
> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>
> It did the trick for me.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> Not terribly efficient but at the top of my head: GROUP ALL and then do a
>> COUNT (or COUNT (*). You can implement a follow-up script or add this in
>> the existing script once the file has been generated.
>>
>> Regards,
>> Shahab
>>
>>
>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Ok, let re modify my requirement. I should have specified in the
>>> beginning itself.
>>>
>>> I need to get count of records in an HDFS file created by a PIG script
>>> and the store the count in a text file. This should be done automatically
>>> on a daily basis without manual intervention
>>>
>>>
>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> How about the second approach , get the application/job id which the
>>>> pig creates and submits to cluster and then find the job output counter for
>>>> that job from the JT.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> It is a text file.
>>>>>
>>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>>> and this may take time. Is there a way without copying file from HDFS to
>>>>> local directory?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> few pointers.
>>>>>>
>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>> for avro data files you can use avro-tools.
>>>>>>
>>>>>> or get the job that pig is generating , get the counters for that job
>>>>>> from the jt of your hadoop cluster.
>>>>>>
>>>>>> Thanks,
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com>wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>> generated by a PIG script.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

If it is just counting the no. of records in a file then how about having a
short 3 liner :
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

It did the trick for me.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:

> Not terribly efficient but at the top of my head: GROUP ALL and then do a
> COUNT (or COUNT (*). You can implement a follow-up script or add this in
> the existing script once the file has been generated.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Ok, let re modify my requirement. I should have specified in the
>> beginning itself.
>>
>> I need to get count of records in an HDFS file created by a PIG script
>> and the store the count in a text file. This should be done automatically
>> on a daily basis without manual intervention
>>
>>
>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> How about the second approach , get the application/job id which the pig
>>> creates and submits to cluster and then find the job output counter for
>>> that job from the JT.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> It is a text file.
>>>>
>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>> and this may take time. Is there a way without copying file from HDFS to
>>>> local directory?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> few pointers.
>>>>>
>>>>> what kind of files are we talking about. for text you can use wc , for
>>>>> avro data files you can use avro-tools.
>>>>>
>>>>> or get the job that pig is generating , get the counters for that job
>>>>> from the jt of your hadoop cluster.
>>>>>
>>>>> Thanks,
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>> generated by a PIG script.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

If it is just counting the no. of records in a file then how about having a
short 3 liner :
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

It did the trick for me.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:

> Not terribly efficient but at the top of my head: GROUP ALL and then do a
> COUNT (or COUNT (*). You can implement a follow-up script or add this in
> the existing script once the file has been generated.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Ok, let re modify my requirement. I should have specified in the
>> beginning itself.
>>
>> I need to get count of records in an HDFS file created by a PIG script
>> and the store the count in a text file. This should be done automatically
>> on a daily basis without manual intervention
>>
>>
>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> How about the second approach , get the application/job id which the pig
>>> creates and submits to cluster and then find the job output counter for
>>> that job from the JT.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> It is a text file.
>>>>
>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>> and this may take time. Is there a way without copying file from HDFS to
>>>> local directory?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> few pointers.
>>>>>
>>>>> what kind of files are we talking about. for text you can use wc , for
>>>>> avro data files you can use avro-tools.
>>>>>
>>>>> or get the job that pig is generating , get the counters for that job
>>>>> from the jt of your hadoop cluster.
>>>>>
>>>>> Thanks,
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>> generated by a PIG script.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

If it is just counting the no. of records in a file then how about having a
short 3 liner :
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

It did the trick for me.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:

> Not terribly efficient but at the top of my head: GROUP ALL and then do a
> COUNT (or COUNT (*). You can implement a follow-up script or add this in
> the existing script once the file has been generated.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Ok, let re modify my requirement. I should have specified in the
>> beginning itself.
>>
>> I need to get count of records in an HDFS file created by a PIG script
>> and the store the count in a text file. This should be done automatically
>> on a daily basis without manual intervention
>>
>>
>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> How about the second approach , get the application/job id which the pig
>>> creates and submits to cluster and then find the job output counter for
>>> that job from the JT.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> It is a text file.
>>>>
>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>> and this may take time. Is there a way without copying file from HDFS to
>>>> local directory?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> few pointers.
>>>>>
>>>>> what kind of files are we talking about. for text you can use wc , for
>>>>> avro data files you can use avro-tools.
>>>>>
>>>>> or get the job that pig is generating , get the counters for that job
>>>>> from the jt of your hadoop cluster.
>>>>>
>>>>> Thanks,
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>> generated by a PIG script.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mohammad Tariq <do...@gmail.com>.

If it is just counting the no. of records in a file then how about having a
short 3 liner :
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

It did the trick for me.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <sh...@gmail.com>wrote:

> Not terribly efficient but at the top of my head: GROUP ALL and then do a
> COUNT (or COUNT (*). You can implement a follow-up script or add this in
> the existing script once the file has been generated.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Ok, let re modify my requirement. I should have specified in the
>> beginning itself.
>>
>> I need to get count of records in an HDFS file created by a PIG script
>> and the store the count in a text file. This should be done automatically
>> on a daily basis without manual intervention
>>
>>
>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> How about the second approach , get the application/job id which the pig
>>> creates and submits to cluster and then find the job output counter for
>>> that job from the JT.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> It is a text file.
>>>>
>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>> and this may take time. Is there a way without copying file from HDFS to
>>>> local directory?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> few pointers.
>>>>>
>>>>> what kind of files are we talking about. for text you can use wc , for
>>>>> avro data files you can use avro-tools.
>>>>>
>>>>> or get the job that pig is generating , get the counters for that job
>>>>> from the jt of your hadoop cluster.
>>>>>
>>>>> Thanks,
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>> generated by a PIG script.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

Not terribly efficient but at the top of my head: GROUP ALL and then do a
COUNT (or COUNT (*). You can implement a follow-up script or add this in
the existing script once the file has been generated.

Regards,
Shahab


On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

Not terribly efficient but at the top of my head: GROUP ALL and then do a
COUNT (or COUNT (*). You can implement a follow-up script or add this in
the existing script once the file has been generated.

Regards,
Shahab


On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I am just spitballing here.

You might want to override the FileOutputFormatter's commit job method ,
which while committing the job , writes the value of the job output record
counter (I think there is a standard counter to give the number of records
outputted by the job) to a file in HDFS.

Not sure if we can plug a custom FOC to a pig workflow.

Another thing is , you can create a workflow statement in pig (in the same
pig script that we are taking about) to get the count of the final bag and
then store it in a file. Can you not ?

Thanks,
Rahul


On Mon, May 13, 2013 at 11:46 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I am just spitballing here.

You might want to override the FileOutputFormatter's commit job method ,
which while committing the job , writes the value of the job output record
counter (I think there is a standard counter to give the number of records
outputted by the job) to a file in HDFS.

Not sure if we can plug a custom FOC to a pig workflow.

Another thing is , you can create a workflow statement in pig (in the same
pig script that we are taking about) to get the count of the final bag and
then store it in a file. Can you not ?

Thanks,
Rahul


On Mon, May 13, 2013 at 11:46 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

Not terribly efficient but at the top of my head: GROUP ALL and then do a
COUNT (or COUNT (*). You can implement a follow-up script or add this in
the existing script once the file has been generated.

Regards,
Shahab


On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Shahab Yunus <sh...@gmail.com>.

Not terribly efficient but at the top of my head: GROUP ALL and then do a
COUNT (or COUNT (*). You can implement a follow-up script or add this in
the existing script once the file has been generated.

Regards,
Shahab


On Mon, May 13, 2013 at 2:16 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I am just spitballing here.

You might want to override the FileOutputFormatter's commit job method ,
which while committing the job , writes the value of the job output record
counter (I think there is a standard counter to give the number of records
outputted by the job) to a file in HDFS.

Not sure if we can plug a custom FOC to a pig workflow.

Another thing is , you can create a workflow statement in pig (in the same
pig script that we are taking about) to get the count of the final bag and
then store it in a file. Can you not ?

Thanks,
Rahul


On Mon, May 13, 2013 at 11:46 PM, Mix Nin <pi...@gmail.com> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Ok, let re modify my requirement. I should have specified in the beginning
itself.

I need to get count of records in an HDFS file created by a PIG script and
the store the count in a text file. This should be done automatically on a
daily basis without manual intervention


On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> How about the second approach , get the application/job id which the pig
> creates and submits to cluster and then find the job output counter for
> that job from the JT.
>
> Thanks,
> Rahul
>
>
> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> It is a text file.
>>
>> If we want to use wc, we need to copy file from HDFS and then use wc, and
>> this may take time. Is there a way without copying file from HDFS to local
>> directory?
>>
>> Thanks
>>
>>
>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> few pointers.
>>>
>>> what kind of files are we talking about. for text you can use wc , for
>>> avro data files you can use avro-tools.
>>>
>>> or get the job that pig is generating , get the counters for that job
>>> from the jt of your hadoop cluster.
>>>
>>> Thanks,
>>>  Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> What is the bets way to get the count of records in an HDFS file
>>>> generated by a PIG script.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Ok, let re modify my requirement. I should have specified in the beginning
itself.

I need to get count of records in an HDFS file created by a PIG script and
the store the count in a text file. This should be done automatically on a
daily basis without manual intervention


On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> How about the second approach , get the application/job id which the pig
> creates and submits to cluster and then find the job output counter for
> that job from the JT.
>
> Thanks,
> Rahul
>
>
> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> It is a text file.
>>
>> If we want to use wc, we need to copy file from HDFS and then use wc, and
>> this may take time. Is there a way without copying file from HDFS to local
>> directory?
>>
>> Thanks
>>
>>
>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> few pointers.
>>>
>>> what kind of files are we talking about. for text you can use wc , for
>>> avro data files you can use avro-tools.
>>>
>>> or get the job that pig is generating , get the counters for that job
>>> from the jt of your hadoop cluster.
>>>
>>> Thanks,
>>>  Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> What is the bets way to get the count of records in an HDFS file
>>>> generated by a PIG script.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Ok, let re modify my requirement. I should have specified in the beginning
itself.

I need to get count of records in an HDFS file created by a PIG script and
the store the count in a text file. This should be done automatically on a
daily basis without manual intervention


On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> How about the second approach , get the application/job id which the pig
> creates and submits to cluster and then find the job output counter for
> that job from the JT.
>
> Thanks,
> Rahul
>
>
> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> It is a text file.
>>
>> If we want to use wc, we need to copy file from HDFS and then use wc, and
>> this may take time. Is there a way without copying file from HDFS to local
>> directory?
>>
>> Thanks
>>
>>
>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> few pointers.
>>>
>>> what kind of files are we talking about. for text you can use wc , for
>>> avro data files you can use avro-tools.
>>>
>>> or get the job that pig is generating , get the counters for that job
>>> from the jt of your hadoop cluster.
>>>
>>> Thanks,
>>>  Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> What is the bets way to get the count of records in an HDFS file
>>>> generated by a PIG script.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

Ok, let re modify my requirement. I should have specified in the beginning
itself.

I need to get count of records in an HDFS file created by a PIG script and
the store the count in a text file. This should be done automatically on a
daily basis without manual intervention


On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> How about the second approach , get the application/job id which the pig
> creates and submits to cluster and then find the job output counter for
> that job from the JT.
>
> Thanks,
> Rahul
>
>
> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> It is a text file.
>>
>> If we want to use wc, we need to copy file from HDFS and then use wc, and
>> this may take time. Is there a way without copying file from HDFS to local
>> directory?
>>
>> Thanks
>>
>>
>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> few pointers.
>>>
>>> what kind of files are we talking about. for text you can use wc , for
>>> avro data files you can use avro-tools.
>>>
>>> or get the job that pig is generating , get the counters for that job
>>> from the jt of your hadoop cluster.
>>>
>>> Thanks,
>>>  Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> What is the bets way to get the count of records in an HDFS file
>>>> generated by a PIG script.
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

How about the second approach , get the application/job id which the pig
creates and submits to cluster and then find the job output counter for
that job from the JT.

Thanks,
Rahul


On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:

> It is a text file.
>
> If we want to use wc, we need to copy file from HDFS and then use wc, and
> this may take time. Is there a way without copying file from HDFS to local
> directory?
>
> Thanks
>
>
> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> few pointers.
>>
>> what kind of files are we talking about. for text you can use wc , for
>> avro data files you can use avro-tools.
>>
>> or get the job that pig is generating , get the counters for that job
>> from the jt of your hadoop cluster.
>>
>> Thanks,
>>  Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> What is the bets way to get the count of records in an HDFS file
>>> generated by a PIG script.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

How about the second approach , get the application/job id which the pig
creates and submits to cluster and then find the job output counter for
that job from the JT.

Thanks,
Rahul


On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:

> It is a text file.
>
> If we want to use wc, we need to copy file from HDFS and then use wc, and
> this may take time. Is there a way without copying file from HDFS to local
> directory?
>
> Thanks
>
>
> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> few pointers.
>>
>> what kind of files are we talking about. for text you can use wc , for
>> avro data files you can use avro-tools.
>>
>> or get the job that pig is generating , get the counters for that job
>> from the jt of your hadoop cluster.
>>
>> Thanks,
>>  Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> What is the bets way to get the count of records in an HDFS file
>>> generated by a PIG script.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

How about the second approach , get the application/job id which the pig
creates and submits to cluster and then find the job output counter for
that job from the JT.

Thanks,
Rahul


On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:

> It is a text file.
>
> If we want to use wc, we need to copy file from HDFS and then use wc, and
> this may take time. Is there a way without copying file from HDFS to local
> directory?
>
> Thanks
>
>
> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> few pointers.
>>
>> what kind of files are we talking about. for text you can use wc , for
>> avro data files you can use avro-tools.
>>
>> or get the job that pig is generating , get the counters for that job
>> from the jt of your hadoop cluster.
>>
>> Thanks,
>>  Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> What is the bets way to get the count of records in an HDFS file
>>> generated by a PIG script.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

How about the second approach , get the application/job id which the pig
creates and submits to cluster and then find the job output counter for
that job from the JT.

Thanks,
Rahul


On Mon, May 13, 2013 at 11:37 PM, Mix Nin <pi...@gmail.com> wrote:

> It is a text file.
>
> If we want to use wc, we need to copy file from HDFS and then use wc, and
> this may take time. Is there a way without copying file from HDFS to local
> directory?
>
> Thanks
>
>
> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> few pointers.
>>
>> what kind of files are we talking about. for text you can use wc , for
>> avro data files you can use avro-tools.
>>
>> or get the job that pig is generating , get the counters for that job
>> from the jt of your hadoop cluster.
>>
>> Thanks,
>>  Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> What is the bets way to get the count of records in an HDFS file
>>> generated by a PIG script.
>>>
>>> Thanks
>>>
>>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

It is a text file.

If we want to use wc, we need to copy file from HDFS and then use wc, and
this may take time. Is there a way without copying file from HDFS to local
directory?

Thanks


On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> few pointers.
>
> what kind of files are we talking about. for text you can use wc , for
> avro data files you can use avro-tools.
>
> or get the job that pig is generating , get the counters for that job from
> the jt of your hadoop cluster.
>
> Thanks,
>  Rahul
>
>
> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hello,
>>
>> What is the bets way to get the count of records in an HDFS file
>> generated by a PIG script.
>>
>> Thanks
>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

It is a text file.

If we want to use wc, we need to copy file from HDFS and then use wc, and
this may take time. Is there a way without copying file from HDFS to local
directory?

Thanks


On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> few pointers.
>
> what kind of files are we talking about. for text you can use wc , for
> avro data files you can use avro-tools.
>
> or get the job that pig is generating , get the counters for that job from
> the jt of your hadoop cluster.
>
> Thanks,
>  Rahul
>
>
> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hello,
>>
>> What is the bets way to get the count of records in an HDFS file
>> generated by a PIG script.
>>
>> Thanks
>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

It is a text file.

If we want to use wc, we need to copy file from HDFS and then use wc, and
this may take time. Is there a way without copying file from HDFS to local
directory?

Thanks


On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> few pointers.
>
> what kind of files are we talking about. for text you can use wc , for
> avro data files you can use avro-tools.
>
> or get the job that pig is generating , get the counters for that job from
> the jt of your hadoop cluster.
>
> Thanks,
>  Rahul
>
>
> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hello,
>>
>> What is the bets way to get the count of records in an HDFS file
>> generated by a PIG script.
>>
>> Thanks
>>
>>
>

Re: Number of records in an HDFS file

Posted by Mix Nin <pi...@gmail.com>.

It is a text file.

If we want to use wc, we need to copy file from HDFS and then use wc, and
this may take time. Is there a way without copying file from HDFS to local
directory?

Thanks


On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> few pointers.
>
> what kind of files are we talking about. for text you can use wc , for
> avro data files you can use avro-tools.
>
> or get the job that pig is generating , get the counters for that job from
> the jt of your hadoop cluster.
>
> Thanks,
>  Rahul
>
>
> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:
>
>> Hello,
>>
>> What is the bets way to get the count of records in an HDFS file
>> generated by a PIG script.
>>
>> Thanks
>>
>>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

few pointers.

what kind of files are we talking about. for text you can use wc , for avro
data files you can use avro-tools.

or get the job that pig is generating , get the counters for that job from
the jt of your hadoop cluster.

Thanks,
Rahul

On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:

> Hello,
>
> What is the bets way to get the count of records in an HDFS file generated
> by a PIG script.
>
> Thanks
>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

few pointers.

what kind of files are we talking about. for text you can use wc , for avro
data files you can use avro-tools.

or get the job that pig is generating , get the counters for that job from
the jt of your hadoop cluster.

Thanks,
Rahul

On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:

> Hello,
>
> What is the bets way to get the count of records in an HDFS file generated
> by a PIG script.
>
> Thanks
>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

few pointers.

what kind of files are we talking about. for text you can use wc , for avro
data files you can use avro-tools.

or get the job that pig is generating , get the counters for that job from
the jt of your hadoop cluster.

Thanks,
Rahul

On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:

> Hello,
>
> What is the bets way to get the count of records in an HDFS file generated
> by a PIG script.
>
> Thanks
>
>

Re: Number of records in an HDFS file

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

few pointers.

what kind of files are we talking about. for text you can use wc , for avro
data files you can use avro-tools.

or get the job that pig is generating , get the counters for that job from
the jt of your hadoop cluster.

Thanks,
Rahul

On Mon, May 13, 2013 at 11:21 PM, Mix Nin <pi...@gmail.com> wrote:

> Hello,
>
> What is the bets way to get the count of records in an HDFS file generated
> by a PIG script.
>
> Thanks
>
>