You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Lucas Bernardi <lu...@gmail.com> on 2013/02/21 22:17:34 UTC

map reduce and sync

Hello there, I'm trying to use hadoop map reduce to process an open file. The
writing process, writes a line to the file and syncs the file to readers.
(org.apache.hadoop.fs.FSDataOutputStream.sync()).

If I try to read the file from another process, it works fine, at least
using
org.apache.hadoop.fs.FSDataInputStream.

hadoop -fs -tail also works just fine

But it looks like map reduce doesn't read any data. I tried using the word
count example, same thing, it is like if the file were empty for the map
reduce framework.

I'm using hadoop 1.0.3. and pig 0.10.0

I need some help around this.

Thanks!

Lucas

Re: mapr videos question

Posted by Marco Shaw <ma...@gmail.com>.

Sorry. Can you provide some specific links?

Marco

On 2013-02-23, at 5:37 AM, Sai Sai <sa...@yahoo.in> wrote:

> 
> Hi
> Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>

Re: Increase the number of mappers in PM mode

Posted by Harsh J <ha...@cloudera.com>.

In MR2, to have more mappers executed per NM, your memory request for each
map should be set such that the NM's configured memory allowance can fit in
multiple requests. For example, if my NM memory is set to 16 GB assuming
just 1 NM in cluster, and I submit a job with mapreduce.map.memory.mb and
yarn.app.mapreduce.am.resource.mb both set to 1 GB, then the NM can execute
15 maps in parallel consuming upto 1 GB memory each (while using the
remaining 1 GB for the AM to coordinate those executions).

On Sat, Mar 16, 2013 at 10:16 AM, yypvsxf19870706 <yypvsxf19870706@gmail.com
> wrote:

> hi：
>    i think i have got it . Thank you.
>
> 发自我的 iPhone
>
> 在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：
>
> Indeed you cannot explicitly set the number of mappers, but still you can
> gain some control over it, by setting mapred.max.split.size, or
> mapred.min.split.size.
>
> For example, if you have a file of 10GB (10737418240 B), you would like 10
> mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
> you can set mapred.min.split.size=1073741824 (1GB), i.e.
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
>
> It is well explained in thread:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> .
>
> Regards,
> Zheyi.
>
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> s
>
>
>
>

-- 
Harsh J

Re: Increase the number of mappers in PM mode

Posted by Harsh J <ha...@cloudera.com>.

In MR2, to have more mappers executed per NM, your memory request for each
map should be set such that the NM's configured memory allowance can fit in
multiple requests. For example, if my NM memory is set to 16 GB assuming
just 1 NM in cluster, and I submit a job with mapreduce.map.memory.mb and
yarn.app.mapreduce.am.resource.mb both set to 1 GB, then the NM can execute
15 maps in parallel consuming upto 1 GB memory each (while using the
remaining 1 GB for the AM to coordinate those executions).

On Sat, Mar 16, 2013 at 10:16 AM, yypvsxf19870706 <yypvsxf19870706@gmail.com
> wrote:

> hi：
>    i think i have got it . Thank you.
>
> 发自我的 iPhone
>
> 在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：
>
> Indeed you cannot explicitly set the number of mappers, but still you can
> gain some control over it, by setting mapred.max.split.size, or
> mapred.min.split.size.
>
> For example, if you have a file of 10GB (10737418240 B), you would like 10
> mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
> you can set mapred.min.split.size=1073741824 (1GB), i.e.
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
>
> It is well explained in thread:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> .
>
> Regards,
> Zheyi.
>
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> s
>
>
>
>

-- 
Harsh J

Re: Increase the number of mappers in PM mode

Posted by Harsh J <ha...@cloudera.com>.

In MR2, to have more mappers executed per NM, your memory request for each
map should be set such that the NM's configured memory allowance can fit in
multiple requests. For example, if my NM memory is set to 16 GB assuming
just 1 NM in cluster, and I submit a job with mapreduce.map.memory.mb and
yarn.app.mapreduce.am.resource.mb both set to 1 GB, then the NM can execute
15 maps in parallel consuming upto 1 GB memory each (while using the
remaining 1 GB for the AM to coordinate those executions).

On Sat, Mar 16, 2013 at 10:16 AM, yypvsxf19870706 <yypvsxf19870706@gmail.com
> wrote:

> hi：
>    i think i have got it . Thank you.
>
> 发自我的 iPhone
>
> 在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：
>
> Indeed you cannot explicitly set the number of mappers, but still you can
> gain some control over it, by setting mapred.max.split.size, or
> mapred.min.split.size.
>
> For example, if you have a file of 10GB (10737418240 B), you would like 10
> mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
> you can set mapred.min.split.size=1073741824 (1GB), i.e.
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
>
> It is well explained in thread:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> .
>
> Regards,
> Zheyi.
>
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> s
>
>
>
>

-- 
Harsh J

Re: Increase the number of mappers in PM mode

Posted by Harsh J <ha...@cloudera.com>.

In MR2, to have more mappers executed per NM, your memory request for each
map should be set such that the NM's configured memory allowance can fit in
multiple requests. For example, if my NM memory is set to 16 GB assuming
just 1 NM in cluster, and I submit a job with mapreduce.map.memory.mb and
yarn.app.mapreduce.am.resource.mb both set to 1 GB, then the NM can execute
15 maps in parallel consuming upto 1 GB memory each (while using the
remaining 1 GB for the AM to coordinate those executions).

On Sat, Mar 16, 2013 at 10:16 AM, yypvsxf19870706 <yypvsxf19870706@gmail.com
> wrote:

> hi：
>    i think i have got it . Thank you.
>
> 发自我的 iPhone
>
> 在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：
>
> Indeed you cannot explicitly set the number of mappers, but still you can
> gain some control over it, by setting mapred.max.split.size, or
> mapred.min.split.size.
>
> For example, if you have a file of 10GB (10737418240 B), you would like 10
> mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
> you can set mapred.min.split.size=1073741824 (1GB), i.e.
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
>
> It is well explained in thread:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> .
>
> Regards,
> Zheyi.
>
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> s
>
>
>
>

-- 
Harsh J

Re: Increase the number of mappers in PM mode

Posted by yypvsxf19870706 <yy...@gmail.com>.

hi��
   i think i have got it . Thank you.

�����ҵ� iPhone

�� 2013-3-15��18:32��Zheyi RONG <ro...@gmail.com> д����

> Indeed you cannot explicitly set the number of mappers, but still you can gain some control over it, by setting mapred.max.split.size, or mapred.min.split.size.
> 
> For example, if you have a file of 10GB (10737418240 B), you would like 10 mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))", you can set mapred.min.split.size=1073741824 (1GB), i.e.    
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
> 
> It is well explained in thread: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.
> 
> Regards,
> Zheyi.
> 
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com> wrote:
>> s
> 
>

Re: Increase the number of mappers in PM mode

Posted by yypvsxf19870706 <yy...@gmail.com>.

hi：
   i think i have got it . Thank you.

发自我的 iPhone

在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：

> Indeed you cannot explicitly set the number of mappers, but still you can gain some control over it, by setting mapred.max.split.size, or mapred.min.split.size.
> 
> For example, if you have a file of 10GB (10737418240 B), you would like 10 mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))", you can set mapred.min.split.size=1073741824 (1GB), i.e.    
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
> 
> It is well explained in thread: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.
> 
> Regards,
> Zheyi.
> 
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com> wrote:
>> s
> 
>

Re: Increase the number of mappers in PM mode

Posted by yypvsxf19870706 <yy...@gmail.com>.

hi：
   i think i have got it . Thank you.

发自我的 iPhone

在 2013-3-15，18:32，Zheyi RONG <ro...@gmail.com> 写道：

> Indeed you cannot explicitly set the number of mappers, but still you can gain some control over it, by setting mapred.max.split.size, or mapred.min.split.size.
> 
> For example, if you have a file of 10GB (10737418240 B), you would like 10 mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))", you can set mapred.min.split.size=1073741824 (1GB), i.e.    
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
> 
> It is well explained in thread: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.
> 
> Regards,
> Zheyi.
> 
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com> wrote:
>> s
> 
>

Re: Increase the number of mappers in PM mode

Posted by yypvsxf19870706 <yy...@gmail.com>.

hi��
   i think i have got it . Thank you.

�����ҵ� iPhone

�� 2013-3-15��18:32��Zheyi RONG <ro...@gmail.com> д����

> Indeed you cannot explicitly set the number of mappers, but still you can gain some control over it, by setting mapred.max.split.size, or mapred.min.split.size.
> 
> For example, if you have a file of 10GB (10737418240 B), you would like 10 mappers, then each mapper has to deal with 1GB data.
> According to "splitsize = max(minimumSize, min(maximumSize, blockSize))", you can set mapred.min.split.size=1073741824 (1GB), i.e.    
> $hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs
> 
> It is well explained in thread: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.
> 
> Regards,
> Zheyi.
> 
> On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com> wrote:
>> s
> 
>

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Indeed you cannot explicitly set the number of mappers, but still you can
gain some control over it, by setting mapred.max.split.size, or
mapred.min.split.size.

For example, if you have a file of 10GB (10737418240 B), you would like 10
mappers, then each mapper has to deal with 1GB data.
According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
you can set mapred.min.split.size=1073741824 (1GB), i.e.
$hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs

It is well explained in thread:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.

Regards,
Zheyi.

On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:

> s

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Indeed you cannot explicitly set the number of mappers, but still you can
gain some control over it, by setting mapred.max.split.size, or
mapred.min.split.size.

For example, if you have a file of 10GB (10737418240 B), you would like 10
mappers, then each mapper has to deal with 1GB data.
According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
you can set mapred.min.split.size=1073741824 (1GB), i.e.
$hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs

It is well explained in thread:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.

Regards,
Zheyi.

On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:

> s

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Indeed you cannot explicitly set the number of mappers, but still you can
gain some control over it, by setting mapred.max.split.size, or
mapred.min.split.size.

For example, if you have a file of 10GB (10737418240 B), you would like 10
mappers, then each mapper has to deal with 1GB data.
According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
you can set mapred.min.split.size=1073741824 (1GB), i.e.
$hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs

It is well explained in thread:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.

Regards,
Zheyi.

On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:

> s

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Indeed you cannot explicitly set the number of mappers, but still you can
gain some control over it, by setting mapred.max.split.size, or
mapred.min.split.size.

For example, if you have a file of 10GB (10737418240 B), you would like 10
mappers, then each mapper has to deal with 1GB data.
According to "splitsize = max(minimumSize, min(maximumSize, blockSize))",
you can set mapred.min.split.size=1073741824 (1GB), i.e.
$hadoop jar -Dmapred.min.split.size=1073741824 yourjar yourargs

It is well explained in thread:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop.

Regards,
Zheyi.

On Fri, Mar 15, 2013 at 8:49 AM, YouPeng Yang <yy...@gmail.com>wrote:

> s

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

HI:
  i get these interview questions  by doing some googles:

Q29. How can you set an arbitary number of mappers to be created for a job
in Hadoop

This is a trick question. You cannot set it

 >> The above test proves you cannot  an arbitary number of mappers .

Q30. How can you set an arbitary number of reducers to be created for a job
in Hadoop

You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting


 I test the Q30,it seems right.

 my logs:

[hadoop@Hadoop01 bin]$./hadoop  jar
 ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar
wordcount -D mapreduce.job.reduces=2  -D mapreduce.jobtracker.address=
10.167.14.221:50030 /user/hadoop/yyp/input /user/hadoop/yyp/output3

===================================

Job Counters

Launched map tasks=1

Launched reduce tasks=2 -----> it actually changed .

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=60356

Total time spent by all reduces in occupied slots (ms)=135224

============================




regards





2013/3/14 YouPeng Yang <yy...@gmail.com>

> Hi
>   the docs only have a property
> : mapreduce.input.fileinputformat.split.minsize (default value is 0)
>   does it matter?
>
>
>
> 2013/3/14 Zheyi RONG <ro...@gmail.com>
>
>> Have you considered change mapred.max.split.size ?
>> As in:
>> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>>
>> Zheyi
>>
>>
>> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>
>>>
>>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>>  :
>>>   According to the doc:
>>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>>> job tracker runs at. If "local", then jobs are run in-process as a single
>>> map and reduce task.
>>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>>
>>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>>
>>>   And then run the wordcount examples:
>>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>>> input output
>>>
>>>   the output logs are as follows:
>>>         ....
>>>    Job Counters
>>> Launched map tasks=1
>>>  Launched reduce tasks=1
>>> Data-local map tasks=1
>>>  Total time spent by all maps in occupied slots (ms)=60336
>>> Total time spent by all reduces in occupied slots (ms)=63264
>>>      Map-Reduce Framework
>>> Map input records=5
>>>  Map output records=7
>>> Map output bytes=56
>>> Map output materialized bytes=76
>>>         ....
>>>
>>>  i seem to does not work.
>>>
>>>  I thought maybe my input file is small-just 5 records . is it right?
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>>
>>>>
>>>>
>>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>>> or is this not possible.
>>>> Thanks
>>>> Sai
>>>>
>>>
>>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

HI:
  i get these interview questions  by doing some googles:

Q29. How can you set an arbitary number of mappers to be created for a job
in Hadoop

This is a trick question. You cannot set it

 >> The above test proves you cannot  an arbitary number of mappers .

Q30. How can you set an arbitary number of reducers to be created for a job
in Hadoop

You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting


 I test the Q30,it seems right.

 my logs:

[hadoop@Hadoop01 bin]$./hadoop  jar
 ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar
wordcount -D mapreduce.job.reduces=2  -D mapreduce.jobtracker.address=
10.167.14.221:50030 /user/hadoop/yyp/input /user/hadoop/yyp/output3

===================================

Job Counters

Launched map tasks=1

Launched reduce tasks=2 -----> it actually changed .

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=60356

Total time spent by all reduces in occupied slots (ms)=135224

============================




regards





2013/3/14 YouPeng Yang <yy...@gmail.com>

> Hi
>   the docs only have a property
> : mapreduce.input.fileinputformat.split.minsize (default value is 0)
>   does it matter?
>
>
>
> 2013/3/14 Zheyi RONG <ro...@gmail.com>
>
>> Have you considered change mapred.max.split.size ?
>> As in:
>> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>>
>> Zheyi
>>
>>
>> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>
>>>
>>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>>  :
>>>   According to the doc:
>>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>>> job tracker runs at. If "local", then jobs are run in-process as a single
>>> map and reduce task.
>>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>>
>>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>>
>>>   And then run the wordcount examples:
>>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>>> input output
>>>
>>>   the output logs are as follows:
>>>         ....
>>>    Job Counters
>>> Launched map tasks=1
>>>  Launched reduce tasks=1
>>> Data-local map tasks=1
>>>  Total time spent by all maps in occupied slots (ms)=60336
>>> Total time spent by all reduces in occupied slots (ms)=63264
>>>      Map-Reduce Framework
>>> Map input records=5
>>>  Map output records=7
>>> Map output bytes=56
>>> Map output materialized bytes=76
>>>         ....
>>>
>>>  i seem to does not work.
>>>
>>>  I thought maybe my input file is small-just 5 records . is it right?
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>>
>>>>
>>>>
>>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>>> or is this not possible.
>>>> Thanks
>>>> Sai
>>>>
>>>
>>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

HI:
  i get these interview questions  by doing some googles:

Q29. How can you set an arbitary number of mappers to be created for a job
in Hadoop

This is a trick question. You cannot set it

 >> The above test proves you cannot  an arbitary number of mappers .

Q30. How can you set an arbitary number of reducers to be created for a job
in Hadoop

You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting


 I test the Q30,it seems right.

 my logs:

[hadoop@Hadoop01 bin]$./hadoop  jar
 ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar
wordcount -D mapreduce.job.reduces=2  -D mapreduce.jobtracker.address=
10.167.14.221:50030 /user/hadoop/yyp/input /user/hadoop/yyp/output3

===================================

Job Counters

Launched map tasks=1

Launched reduce tasks=2 -----> it actually changed .

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=60356

Total time spent by all reduces in occupied slots (ms)=135224

============================




regards





2013/3/14 YouPeng Yang <yy...@gmail.com>

> Hi
>   the docs only have a property
> : mapreduce.input.fileinputformat.split.minsize (default value is 0)
>   does it matter?
>
>
>
> 2013/3/14 Zheyi RONG <ro...@gmail.com>
>
>> Have you considered change mapred.max.split.size ?
>> As in:
>> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>>
>> Zheyi
>>
>>
>> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>
>>>
>>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>>  :
>>>   According to the doc:
>>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>>> job tracker runs at. If "local", then jobs are run in-process as a single
>>> map and reduce task.
>>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>>
>>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>>
>>>   And then run the wordcount examples:
>>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>>> input output
>>>
>>>   the output logs are as follows:
>>>         ....
>>>    Job Counters
>>> Launched map tasks=1
>>>  Launched reduce tasks=1
>>> Data-local map tasks=1
>>>  Total time spent by all maps in occupied slots (ms)=60336
>>> Total time spent by all reduces in occupied slots (ms)=63264
>>>      Map-Reduce Framework
>>> Map input records=5
>>>  Map output records=7
>>> Map output bytes=56
>>> Map output materialized bytes=76
>>>         ....
>>>
>>>  i seem to does not work.
>>>
>>>  I thought maybe my input file is small-just 5 records . is it right?
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>>
>>>>
>>>>
>>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>>> or is this not possible.
>>>> Thanks
>>>> Sai
>>>>
>>>
>>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

HI:
  i get these interview questions  by doing some googles:

Q29. How can you set an arbitary number of mappers to be created for a job
in Hadoop

This is a trick question. You cannot set it

 >> The above test proves you cannot  an arbitary number of mappers .

Q30. How can you set an arbitary number of reducers to be created for a job
in Hadoop

You can either do it progamatically by using method setNumReduceTasksin the
JobConfclass or set it up as a configuration setting


 I test the Q30,it seems right.

 my logs:

[hadoop@Hadoop01 bin]$./hadoop  jar
 ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar
wordcount -D mapreduce.job.reduces=2  -D mapreduce.jobtracker.address=
10.167.14.221:50030 /user/hadoop/yyp/input /user/hadoop/yyp/output3

===================================

Job Counters

Launched map tasks=1

Launched reduce tasks=2 -----> it actually changed .

Rack-local map tasks=1

Total time spent by all maps in occupied slots (ms)=60356

Total time spent by all reduces in occupied slots (ms)=135224

============================




regards





2013/3/14 YouPeng Yang <yy...@gmail.com>

> Hi
>   the docs only have a property
> : mapreduce.input.fileinputformat.split.minsize (default value is 0)
>   does it matter?
>
>
>
> 2013/3/14 Zheyi RONG <ro...@gmail.com>
>
>> Have you considered change mapred.max.split.size ?
>> As in:
>> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>>
>> Zheyi
>>
>>
>> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>
>>>
>>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>>  :
>>>   According to the doc:
>>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>>> job tracker runs at. If "local", then jobs are run in-process as a single
>>> map and reduce task.
>>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>>
>>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>>
>>>   And then run the wordcount examples:
>>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>>> input output
>>>
>>>   the output logs are as follows:
>>>         ....
>>>    Job Counters
>>> Launched map tasks=1
>>>  Launched reduce tasks=1
>>> Data-local map tasks=1
>>>  Total time spent by all maps in occupied slots (ms)=60336
>>> Total time spent by all reduces in occupied slots (ms)=63264
>>>      Map-Reduce Framework
>>> Map input records=5
>>>  Map output records=7
>>> Map output bytes=56
>>> Map output materialized bytes=76
>>>         ....
>>>
>>>  i seem to does not work.
>>>
>>>  I thought maybe my input file is small-just 5 records . is it right?
>>>
>>> regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>>
>>>>
>>>>
>>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>>> or is this not possible.
>>>> Thanks
>>>> Sai
>>>>
>>>
>>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi
  the docs only have a property
: mapreduce.input.fileinputformat.split.minsize (default value is 0)
  does it matter?



2013/3/14 Zheyi RONG <ro...@gmail.com>

> Have you considered change mapred.max.split.size ?
> As in:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>
> Zheyi
>
>
> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>
>>
>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>  :
>>   According to the doc:
>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>> job tracker runs at. If "local", then jobs are run in-process as a single
>> map and reduce task.
>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>
>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>
>>   And then run the wordcount examples:
>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>> input output
>>
>>   the output logs are as follows:
>>         ....
>>    Job Counters
>> Launched map tasks=1
>>  Launched reduce tasks=1
>> Data-local map tasks=1
>>  Total time spent by all maps in occupied slots (ms)=60336
>> Total time spent by all reduces in occupied slots (ms)=63264
>>      Map-Reduce Framework
>> Map input records=5
>>  Map output records=7
>> Map output bytes=56
>> Map output materialized bytes=76
>>         ....
>>
>>  i seem to does not work.
>>
>>  I thought maybe my input file is small-just 5 records . is it right?
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>
>>>
>>>
>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>> or is this not possible.
>>> Thanks
>>> Sai
>>>
>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi
  the docs only have a property
: mapreduce.input.fileinputformat.split.minsize (default value is 0)
  does it matter?



2013/3/14 Zheyi RONG <ro...@gmail.com>

> Have you considered change mapred.max.split.size ?
> As in:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>
> Zheyi
>
>
> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>
>>
>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>  :
>>   According to the doc:
>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>> job tracker runs at. If "local", then jobs are run in-process as a single
>> map and reduce task.
>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>
>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>
>>   And then run the wordcount examples:
>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>> input output
>>
>>   the output logs are as follows:
>>         ....
>>    Job Counters
>> Launched map tasks=1
>>  Launched reduce tasks=1
>> Data-local map tasks=1
>>  Total time spent by all maps in occupied slots (ms)=60336
>> Total time spent by all reduces in occupied slots (ms)=63264
>>      Map-Reduce Framework
>> Map input records=5
>>  Map output records=7
>> Map output bytes=56
>> Map output materialized bytes=76
>>         ....
>>
>>  i seem to does not work.
>>
>>  I thought maybe my input file is small-just 5 records . is it right?
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>
>>>
>>>
>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>> or is this not possible.
>>> Thanks
>>> Sai
>>>
>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi
  the docs only have a property
: mapreduce.input.fileinputformat.split.minsize (default value is 0)
  does it matter?



2013/3/14 Zheyi RONG <ro...@gmail.com>

> Have you considered change mapred.max.split.size ?
> As in:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>
> Zheyi
>
>
> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>
>>
>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>  :
>>   According to the doc:
>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>> job tracker runs at. If "local", then jobs are run in-process as a single
>> map and reduce task.
>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>
>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>
>>   And then run the wordcount examples:
>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>> input output
>>
>>   the output logs are as follows:
>>         ....
>>    Job Counters
>> Launched map tasks=1
>>  Launched reduce tasks=1
>> Data-local map tasks=1
>>  Total time spent by all maps in occupied slots (ms)=60336
>> Total time spent by all reduces in occupied slots (ms)=63264
>>      Map-Reduce Framework
>> Map input records=5
>>  Map output records=7
>> Map output bytes=56
>> Map output materialized bytes=76
>>         ....
>>
>>  i seem to does not work.
>>
>>  I thought maybe my input file is small-just 5 records . is it right?
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>
>>>
>>>
>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>> or is this not possible.
>>> Thanks
>>> Sai
>>>
>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi
  the docs only have a property
: mapreduce.input.fileinputformat.split.minsize (default value is 0)
  does it matter?



2013/3/14 Zheyi RONG <ro...@gmail.com>

> Have you considered change mapred.max.split.size ?
> As in:
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
>
> Zheyi
>
>
> On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>
>>
>>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and
>>  :
>>   According to the doc:
>>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
>> job tracker runs at. If "local", then jobs are run in-process as a single
>> map and reduce task.
>>   *mapreduce.job.maps (default value is 2)* :The default number of map
>> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>>
>>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>>
>>   And then run the wordcount examples:
>>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
>> input output
>>
>>   the output logs are as follows:
>>         ....
>>    Job Counters
>> Launched map tasks=1
>>  Launched reduce tasks=1
>> Data-local map tasks=1
>>  Total time spent by all maps in occupied slots (ms)=60336
>> Total time spent by all reduces in occupied slots (ms)=63264
>>      Map-Reduce Framework
>> Map input records=5
>>  Map output records=7
>> Map output bytes=56
>> Map output materialized bytes=76
>>         ....
>>
>>  i seem to does not work.
>>
>>  I thought maybe my input file is small-just 5 records . is it right?
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>> 2013/3/14 Sai Sai <sa...@yahoo.in>
>>
>>>
>>>
>>>  In Pseudo Mode where is the setting to increase the number of mappers
>>> or is this not possible.
>>> Thanks
>>> Sai
>>>
>>
>>
>

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Have you considered change mapred.max.split.size ?
As in:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

Zheyi

On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>
>
>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
>   According to the doc:
>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
> job tracker runs at. If "local", then jobs are run in-process as a single
> map and reduce task.
>   *mapreduce.job.maps (default value is 2)* :The default number of map
> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>
>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>
>   And then run the wordcount examples:
>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
> input output
>
>   the output logs are as follows:
>         ....
>    Job Counters
> Launched map tasks=1
>  Launched reduce tasks=1
> Data-local map tasks=1
>  Total time spent by all maps in occupied slots (ms)=60336
> Total time spent by all reduces in occupied slots (ms)=63264
>      Map-Reduce Framework
> Map input records=5
>  Map output records=7
> Map output bytes=56
> Map output materialized bytes=76
>         ....
>
>  i seem to does not work.
>
>  I thought maybe my input file is small-just 5 records . is it right?
>
> regards
>
>
>
>
>
>
>
> 2013/3/14 Sai Sai <sa...@yahoo.in>
>
>>
>>
>>  In Pseudo Mode where is the setting to increase the number of mappers or
>> is this not possible.
>> Thanks
>> Sai
>>
>
>

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Have you considered change mapred.max.split.size ?
As in:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

Zheyi

On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>
>
>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
>   According to the doc:
>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
> job tracker runs at. If "local", then jobs are run in-process as a single
> map and reduce task.
>   *mapreduce.job.maps (default value is 2)* :The default number of map
> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>
>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>
>   And then run the wordcount examples:
>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
> input output
>
>   the output logs are as follows:
>         ....
>    Job Counters
> Launched map tasks=1
>  Launched reduce tasks=1
> Data-local map tasks=1
>  Total time spent by all maps in occupied slots (ms)=60336
> Total time spent by all reduces in occupied slots (ms)=63264
>      Map-Reduce Framework
> Map input records=5
>  Map output records=7
> Map output bytes=56
> Map output materialized bytes=76
>         ....
>
>  i seem to does not work.
>
>  I thought maybe my input file is small-just 5 records . is it right?
>
> regards
>
>
>
>
>
>
>
> 2013/3/14 Sai Sai <sa...@yahoo.in>
>
>>
>>
>>  In Pseudo Mode where is the setting to increase the number of mappers or
>> is this not possible.
>> Thanks
>> Sai
>>
>
>

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Have you considered change mapred.max.split.size ?
As in:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

Zheyi

On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>
>
>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
>   According to the doc:
>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
> job tracker runs at. If "local", then jobs are run in-process as a single
> map and reduce task.
>   *mapreduce.job.maps (default value is 2)* :The default number of map
> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>
>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>
>   And then run the wordcount examples:
>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
> input output
>
>   the output logs are as follows:
>         ....
>    Job Counters
> Launched map tasks=1
>  Launched reduce tasks=1
> Data-local map tasks=1
>  Total time spent by all maps in occupied slots (ms)=60336
> Total time spent by all reduces in occupied slots (ms)=63264
>      Map-Reduce Framework
> Map input records=5
>  Map output records=7
> Map output bytes=56
> Map output materialized bytes=76
>         ....
>
>  i seem to does not work.
>
>  I thought maybe my input file is small-just 5 records . is it right?
>
> regards
>
>
>
>
>
>
>
> 2013/3/14 Sai Sai <sa...@yahoo.in>
>
>>
>>
>>  In Pseudo Mode where is the setting to increase the number of mappers or
>> is this not possible.
>> Thanks
>> Sai
>>
>
>

Re: Increase the number of mappers in PM mode

Posted by Zheyi RONG <ro...@gmail.com>.

Have you considered change mapred.max.split.size ?
As in:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

Zheyi

On Thu, Mar 14, 2013 at 3:27 PM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>
>
>   I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
>   According to the doc:
>   *mapreduce.jobtracker.address :*The host and port that the MapReduce
> job tracker runs at. If "local", then jobs are run in-process as a single
> map and reduce task.
>   *mapreduce.job.maps (default value is 2)* :The default number of map
> tasks per job. Ignored when mapreduce.jobtracker.address is "local".
>
>   I changed the mapreduce.jobtracker.address = Hadoop:50031.
>
>   And then run the wordcount examples:
>   hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
> input output
>
>   the output logs are as follows:
>         ....
>    Job Counters
> Launched map tasks=1
>  Launched reduce tasks=1
> Data-local map tasks=1
>  Total time spent by all maps in occupied slots (ms)=60336
> Total time spent by all reduces in occupied slots (ms)=63264
>      Map-Reduce Framework
> Map input records=5
>  Map output records=7
> Map output bytes=56
> Map output materialized bytes=76
>         ....
>
>  i seem to does not work.
>
>  I thought maybe my input file is small-just 5 records . is it right?
>
> regards
>
>
>
>
>
>
>
> 2013/3/14 Sai Sai <sa...@yahoo.in>
>
>>
>>
>>  In Pseudo Mode where is the setting to increase the number of mappers or
>> is this not possible.
>> Thanks
>> Sai
>>
>
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi


  I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
  According to the doc:
  *mapreduce.jobtracker.address :*The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-process as a single map
and reduce task.
  *mapreduce.job.maps (default value is 2)* :The default number of map
tasks per job. Ignored when mapreduce.jobtracker.address is "local".

  I changed the mapreduce.jobtracker.address = Hadoop:50031.

  And then run the wordcount examples:
  hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
input output

  the output logs are as follows:
        ....
   Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=60336
Total time spent by all reduces in occupied slots (ms)=63264
     Map-Reduce Framework
Map input records=5
Map output records=7
Map output bytes=56
Map output materialized bytes=76
        ....

 i seem to does not work.

 I thought maybe my input file is small-just 5 records . is it right?

regards







2013/3/14 Sai Sai <sa...@yahoo.in>

>
>
> In Pseudo Mode where is the setting to increase the number of mappers or
> is this not possible.
> Thanks
> Sai
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi


  I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
  According to the doc:
  *mapreduce.jobtracker.address :*The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-process as a single map
and reduce task.
  *mapreduce.job.maps (default value is 2)* :The default number of map
tasks per job. Ignored when mapreduce.jobtracker.address is "local".

  I changed the mapreduce.jobtracker.address = Hadoop:50031.

  And then run the wordcount examples:
  hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
input output

  the output logs are as follows:
        ....
   Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=60336
Total time spent by all reduces in occupied slots (ms)=63264
     Map-Reduce Framework
Map input records=5
Map output records=7
Map output bytes=56
Map output materialized bytes=76
        ....

 i seem to does not work.

 I thought maybe my input file is small-just 5 records . is it right?

regards







2013/3/14 Sai Sai <sa...@yahoo.in>

>
>
> In Pseudo Mode where is the setting to increase the number of mappers or
> is this not possible.
> Thanks
> Sai
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi


  I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
  According to the doc:
  *mapreduce.jobtracker.address :*The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-process as a single map
and reduce task.
  *mapreduce.job.maps (default value is 2)* :The default number of map
tasks per job. Ignored when mapreduce.jobtracker.address is "local".

  I changed the mapreduce.jobtracker.address = Hadoop:50031.

  And then run the wordcount examples:
  hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
input output

  the output logs are as follows:
        ....
   Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=60336
Total time spent by all reduces in occupied slots (ms)=63264
     Map-Reduce Framework
Map input records=5
Map output records=7
Map output bytes=56
Map output materialized bytes=76
        ....

 i seem to does not work.

 I thought maybe my input file is small-just 5 records . is it right?

regards







2013/3/14 Sai Sai <sa...@yahoo.in>

>
>
> In Pseudo Mode where is the setting to increase the number of mappers or
> is this not possible.
> Thanks
> Sai
>

Re: Increase the number of mappers in PM mode

Posted by YouPeng Yang <yy...@gmail.com>.

Hi


  I have done some tests in my  Pseudo Mode(CDH4.1.2)with MV2 yarn,and   :
  According to the doc:
  *mapreduce.jobtracker.address :*The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-process as a single map
and reduce task.
  *mapreduce.job.maps (default value is 2)* :The default number of map
tasks per job. Ignored when mapreduce.jobtracker.address is "local".

  I changed the mapreduce.jobtracker.address = Hadoop:50031.

  And then run the wordcount examples:
  hadoop jar  hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar wordcount
input output

  the output logs are as follows:
        ....
   Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=60336
Total time spent by all reduces in occupied slots (ms)=63264
     Map-Reduce Framework
Map input records=5
Map output records=7
Map output bytes=56
Map output materialized bytes=76
        ....

 i seem to does not work.

 I thought maybe my input file is small-just 5 records . is it right?

regards







2013/3/14 Sai Sai <sa...@yahoo.in>

>
>
> In Pseudo Mode where is the setting to increase the number of mappers or
> is this not possible.
> Thanks
> Sai
>

Re: Increase the number of mappers in PM mode

Posted by Sai Sai <sa...@yahoo.in>.



In Pseudo Mode where is the setting to increase the number of mappers or is this not possible.
Thanks
Sai

Re: Increase the number of mappers in PM mode

Posted by Sai Sai <sa...@yahoo.in>.



In Pseudo Mode where is the setting to increase the number of mappers or is this not possible.
Thanks
Sai

Re: Increase the number of mappers in PM mode

Posted by Sai Sai <sa...@yahoo.in>.



In Pseudo Mode where is the setting to increase the number of mappers or is this not possible.
Thanks
Sai

Re: Increase the number of mappers in PM mode

Posted by Sai Sai <sa...@yahoo.in>.



In Pseudo Mode where is the setting to increase the number of mappers or is this not possible.
Thanks
Sai

Re: Block vs FileSplit vs record vs line

Posted by Sai Sai <sa...@yahoo.in>.

Just wondering if this is right way to understand this:
A large file is split into multiple blocks and each block is split into multiple file splits and each file split has multiple records and each record has multiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai

RE: Unknown processes unable to terminate

Posted by Leo Leung <ll...@ddn.com>.

Hi Sai,

   The RunJar process is normally the result of someone or something running “hadoop jar <something>”
   (i.e:  org.apache.hadoop.util.RunJar  <something>)

   You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar
   <stop|start>-all.sh deals with hdfs/ M/R specific process only.   So it will not stop any other java process reported by jps.

Cheers.


From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Monday, March 04, 2013 1:42 AM
To: user@hadoop.apache.org
Subject: Re: Unknown processes unable to terminate

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:


30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode

Thanks
Sai

RE: Unknown processes unable to terminate

Posted by Leo Leung <ll...@ddn.com>.

Hi Sai,

   The RunJar process is normally the result of someone or something running “hadoop jar <something>”
   (i.e:  org.apache.hadoop.util.RunJar  <something>)

   You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar
   <stop|start>-all.sh deals with hdfs/ M/R specific process only.   So it will not stop any other java process reported by jps.

Cheers.


From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Monday, March 04, 2013 1:42 AM
To: user@hadoop.apache.org
Subject: Re: Unknown processes unable to terminate

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:


30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode

Thanks
Sai

RE: Unknown processes unable to terminate

Posted by Leo Leung <ll...@ddn.com>.

Hi Sai,

   The RunJar process is normally the result of someone or something running “hadoop jar <something>”
   (i.e:  org.apache.hadoop.util.RunJar  <something>)

   You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar
   <stop|start>-all.sh deals with hdfs/ M/R specific process only.   So it will not stop any other java process reported by jps.

Cheers.


From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Monday, March 04, 2013 1:42 AM
To: user@hadoop.apache.org
Subject: Re: Unknown processes unable to terminate

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:


30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode

Thanks
Sai

RE: Unknown processes unable to terminate

Posted by Leo Leung <ll...@ddn.com>.

Hi Sai,

   The RunJar process is normally the result of someone or something running “hadoop jar <something>”
   (i.e:  org.apache.hadoop.util.RunJar  <something>)

   You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar
   <stop|start>-all.sh deals with hdfs/ M/R specific process only.   So it will not stop any other java process reported by jps.

Cheers.


From: Sai Sai [mailto:saigraph@yahoo.in]
Sent: Monday, March 04, 2013 1:42 AM
To: user@hadoop.apache.org
Subject: Re: Unknown processes unable to terminate

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:


30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode

Thanks
Sai

Re: Unknown processes unable to terminate

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Sai,

Are you fine to kill all those process on this machine? If you need
ALL those process to be killed, and if they are all Java processes,
you can use killall -9 java. That will kill ALL the java process under
this user.

JM

2013/3/4 shashwat shriparv <dw...@gmail.com>:
> You can you kill -9 13082
>
> Is there eclipse or netbeans project running, that may the this process..
>
>
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> I have a list of following processes given below, i am trying to kill the
>> process 13082 using:
>>
>> kill 13082
>>
>> Its not terminating RunJar.
>>
>> I have done a stop-all.sh hoping it would stop all the processes but only
>> stopped the hadoop related processes.
>> I am just wondering if it is necessary to stop all other processes before
>> starting the hadoop process and how to stop these other processes.
>>
>> Here is the list of processes which r appearing:
>>
>>
>> 30969 FileSystemCat
>> 30877 FileSystemCat
>> 5647 StreamCompressor
>> 32200 DataNode
>> 25015 Jps
>> 2227 URLCat
>> 5563 StreamCompressor
>> 5398 StreamCompressor
>> 13082 RunJar
>> 32578 JobTracker
>> 7215
>> 385 TaskTracker
>> 31884 NameNode
>> 32489 SecondaryNameNode
>>
>> Thanks
>> Sai
>
>

Re: Unknown processes unable to terminate

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Sai,

Are you fine to kill all those process on this machine? If you need
ALL those process to be killed, and if they are all Java processes,
you can use killall -9 java. That will kill ALL the java process under
this user.

JM

2013/3/4 shashwat shriparv <dw...@gmail.com>:
> You can you kill -9 13082
>
> Is there eclipse or netbeans project running, that may the this process..
>
>
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> I have a list of following processes given below, i am trying to kill the
>> process 13082 using:
>>
>> kill 13082
>>
>> Its not terminating RunJar.
>>
>> I have done a stop-all.sh hoping it would stop all the processes but only
>> stopped the hadoop related processes.
>> I am just wondering if it is necessary to stop all other processes before
>> starting the hadoop process and how to stop these other processes.
>>
>> Here is the list of processes which r appearing:
>>
>>
>> 30969 FileSystemCat
>> 30877 FileSystemCat
>> 5647 StreamCompressor
>> 32200 DataNode
>> 25015 Jps
>> 2227 URLCat
>> 5563 StreamCompressor
>> 5398 StreamCompressor
>> 13082 RunJar
>> 32578 JobTracker
>> 7215
>> 385 TaskTracker
>> 31884 NameNode
>> 32489 SecondaryNameNode
>>
>> Thanks
>> Sai
>
>

Re: Unknown processes unable to terminate

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Sai,

Are you fine to kill all those process on this machine? If you need
ALL those process to be killed, and if they are all Java processes,
you can use killall -9 java. That will kill ALL the java process under
this user.

JM

2013/3/4 shashwat shriparv <dw...@gmail.com>:
> You can you kill -9 13082
>
> Is there eclipse or netbeans project running, that may the this process..
>
>
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> I have a list of following processes given below, i am trying to kill the
>> process 13082 using:
>>
>> kill 13082
>>
>> Its not terminating RunJar.
>>
>> I have done a stop-all.sh hoping it would stop all the processes but only
>> stopped the hadoop related processes.
>> I am just wondering if it is necessary to stop all other processes before
>> starting the hadoop process and how to stop these other processes.
>>
>> Here is the list of processes which r appearing:
>>
>>
>> 30969 FileSystemCat
>> 30877 FileSystemCat
>> 5647 StreamCompressor
>> 32200 DataNode
>> 25015 Jps
>> 2227 URLCat
>> 5563 StreamCompressor
>> 5398 StreamCompressor
>> 13082 RunJar
>> 32578 JobTracker
>> 7215
>> 385 TaskTracker
>> 31884 NameNode
>> 32489 SecondaryNameNode
>>
>> Thanks
>> Sai
>
>

Re: Unknown processes unable to terminate

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Sai,

Are you fine to kill all those process on this machine? If you need
ALL those process to be killed, and if they are all Java processes,
you can use killall -9 java. That will kill ALL the java process under
this user.

JM

2013/3/4 shashwat shriparv <dw...@gmail.com>:
> You can you kill -9 13082
>
> Is there eclipse or netbeans project running, that may the this process..
>
>
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> I have a list of following processes given below, i am trying to kill the
>> process 13082 using:
>>
>> kill 13082
>>
>> Its not terminating RunJar.
>>
>> I have done a stop-all.sh hoping it would stop all the processes but only
>> stopped the hadoop related processes.
>> I am just wondering if it is necessary to stop all other processes before
>> starting the hadoop process and how to stop these other processes.
>>
>> Here is the list of processes which r appearing:
>>
>>
>> 30969 FileSystemCat
>> 30877 FileSystemCat
>> 5647 StreamCompressor
>> 32200 DataNode
>> 25015 Jps
>> 2227 URLCat
>> 5563 StreamCompressor
>> 5398 StreamCompressor
>> 13082 RunJar
>> 32578 JobTracker
>> 7215
>> 385 TaskTracker
>> 31884 NameNode
>> 32489 SecondaryNameNode
>>
>> Thanks
>> Sai
>
>

Re: Unknown processes unable to terminate

Posted by shashwat shriparv <dw...@gmail.com>.

You can you kill -9 13082

Is there eclipse or netbeans project running, that may the this process..



∞
Shashwat Shriparv



On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:

> I have a list of following processes given below, i am trying to kill the
> process 13082 using:
>
> kill 13082
>
> Its not terminating RunJar.
>
> I have done a stop-all.sh hoping it would stop all the processes but only
> stopped the hadoop related processes.
> I am just wondering if it is necessary to stop all other processes before
> starting the hadoop process and how to stop these other processes.
>
> Here is the list of processes which r appearing:
>
>
> 30969 FileSystemCat
> 30877 FileSystemCat
> 5647 StreamCompressor
> 32200 DataNode
> 25015 Jps
> 2227 URLCat
> 5563 StreamCompressor
> 5398 StreamCompressor
> 13082 RunJar
> 32578 JobTracker
> 7215
> 385 TaskTracker
> 31884 NameNode
> 32489 SecondaryNameNode
>
> Thanks
> Sai
>

Re: Unknown processes unable to terminate

Posted by shashwat shriparv <dw...@gmail.com>.

You can you kill -9 13082

Is there eclipse or netbeans project running, that may the this process..



∞
Shashwat Shriparv



On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:

> I have a list of following processes given below, i am trying to kill the
> process 13082 using:
>
> kill 13082
>
> Its not terminating RunJar.
>
> I have done a stop-all.sh hoping it would stop all the processes but only
> stopped the hadoop related processes.
> I am just wondering if it is necessary to stop all other processes before
> starting the hadoop process and how to stop these other processes.
>
> Here is the list of processes which r appearing:
>
>
> 30969 FileSystemCat
> 30877 FileSystemCat
> 5647 StreamCompressor
> 32200 DataNode
> 25015 Jps
> 2227 URLCat
> 5563 StreamCompressor
> 5398 StreamCompressor
> 13082 RunJar
> 32578 JobTracker
> 7215
> 385 TaskTracker
> 31884 NameNode
> 32489 SecondaryNameNode
>
> Thanks
> Sai
>

Re: Unknown processes unable to terminate

Posted by shashwat shriparv <dw...@gmail.com>.

You can you kill -9 13082

Is there eclipse or netbeans project running, that may the this process..



∞
Shashwat Shriparv



On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:

> I have a list of following processes given below, i am trying to kill the
> process 13082 using:
>
> kill 13082
>
> Its not terminating RunJar.
>
> I have done a stop-all.sh hoping it would stop all the processes but only
> stopped the hadoop related processes.
> I am just wondering if it is necessary to stop all other processes before
> starting the hadoop process and how to stop these other processes.
>
> Here is the list of processes which r appearing:
>
>
> 30969 FileSystemCat
> 30877 FileSystemCat
> 5647 StreamCompressor
> 32200 DataNode
> 25015 Jps
> 2227 URLCat
> 5563 StreamCompressor
> 5398 StreamCompressor
> 13082 RunJar
> 32578 JobTracker
> 7215
> 385 TaskTracker
> 31884 NameNode
> 32489 SecondaryNameNode
>
> Thanks
> Sai
>

Re: Block vs FileSplit vs record vs line

Posted by Sai Sai <sa...@yahoo.in>.

Just wondering if this is right way to understand this:
A large file is split into multiple blocks and each block is split into multiple file splits and each file split has multiple records and each record has multiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai

Re: Block vs FileSplit vs record vs line

Posted by Sai Sai <sa...@yahoo.in>.

Just wondering if this is right way to understand this:
A large file is split into multiple blocks and each block is split into multiple file splits and each file split has multiple records and each record has multiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai

Re: Unknown processes unable to terminate

Posted by shashwat shriparv <dw...@gmail.com>.

You can you kill -9 13082

Is there eclipse or netbeans project running, that may the this process..



∞
Shashwat Shriparv



On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai <sa...@yahoo.in> wrote:

> I have a list of following processes given below, i am trying to kill the
> process 13082 using:
>
> kill 13082
>
> Its not terminating RunJar.
>
> I have done a stop-all.sh hoping it would stop all the processes but only
> stopped the hadoop related processes.
> I am just wondering if it is necessary to stop all other processes before
> starting the hadoop process and how to stop these other processes.
>
> Here is the list of processes which r appearing:
>
>
> 30969 FileSystemCat
> 30877 FileSystemCat
> 5647 StreamCompressor
> 32200 DataNode
> 25015 Jps
> 2227 URLCat
> 5563 StreamCompressor
> 5398 StreamCompressor
> 13082 RunJar
> 32578 JobTracker
> 7215
> 385 TaskTracker
> 31884 NameNode
> 32489 SecondaryNameNode
>
> Thanks
> Sai
>

Re: Block vs FileSplit vs record vs line

Posted by Sai Sai <sa...@yahoo.in>.

Just wondering if this is right way to understand this:
A large file is split into multiple blocks and each block is split into multiple file splits and each file split has multiple records and each record has multiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai

Re: Unknown processes unable to terminate

Posted by Sai Sai <sa...@yahoo.in>.

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:



30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215 
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode


Thanks
Sai

Re: Unknown processes unable to terminate

Posted by Sai Sai <sa...@yahoo.in>.

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:



30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215 
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode


Thanks
Sai

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

Sai,

just use 127.0.0.1 in all the URIs you have. Less complicated and easily
replaceable


On Sun, Feb 24, 2013 at 5:37 PM, sudhakara st <su...@gmail.com>wrote:

> Hi,
>
> Execute ifcongf find the IP of system
> and add line in /etc/host
> (your ip) ubuntu
>
> use URI string  : public static String fsURI = "hdfs://ubuntu:9000";
>
>
> On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Many Thanks Nitin for your quick reply.
>>
>> Heres what i have in my hosts file and i am running in VM i m assuming it
>> is the pseudo mode:
>>
>> *********************
>> 127.0.0.1    localhost.localdomain    localhost
>> #::1    ubuntu    localhost6.localdomain6    localhost6
>> #127.0.1.1    ubuntu
>> 127.0.0.1   ubuntu
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1     localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>> *********************
>> In my masters i have:
>> ubuntu
>> In my slaves i have:
>> localhost
>> ***********************
>> My question is in my variable below:
>> public static String fsURI = "hdfs://master:9000";
>>
>> what would be the right value so i can connect to Hadoop successfully.
>> Please let me know if you need more info.
>> Thanks
>> Sai
>>
>>
>>
>>
>>
>>    ------------------------------
>> *From:* Nitin Pawar <ni...@gmail.com>
>> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> *Sent:* Sunday, 24 February 2013 3:42 AM
>> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>>
>> if you want to use master as your hostname then make such entry in your
>> /etc/hosts file
>>
>> or change the hdfs://master to hdfs://localhost
>>
>>
>> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Greetings,
>>
>> Below is the program i am trying to run and getting this exception:
>>  ***************************************
>> Test Start.....
>> java.net.UnknownHostException: unknown host: master
>>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>     at $Proxy1.getProtocolVersion(Unknown Source)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>     at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>>     at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>>
>>
>> ********************
>>
>> public class HdpTest {
>>
>>     public static String fsURI = "hdfs://master:9000";
>>
>>
>>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>>             String dstFile) throws IOException {
>>         try {
>>             System.out.println("Initialize copy...");
>>             URI suri = new URI(srcFile);
>>             URI duri = new URI(fsURI + "/" + dstFile);
>>             Path dst = new Path(duri.toString());
>>             Path src = new Path(suri.toString());
>>             System.out.println("Start copy...");
>>             fs.copyFromLocalFile(src, dst);
>>             System.out.println("End copy...");
>>         } catch (Exception e) {
>>             e.printStackTrace();
>>         }
>>     }
>>
>>     public static void main(String[] args) {
>>         try {
>>             System.out.println("Test Start.....");
>>             Configuration conf = new Configuration();
>>             DistributedFileSystem fs = new DistributedFileSystem();
>>             URI duri = new URI(fsURI);
>>             fs.initialize(duri, conf); // Here is the xception occuring
>>             long start = 0, end = 0;
>>             start = System.nanoTime();
>>             //writing data from local to HDFS
>>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>>                     "/input/raptor/trade1.txt");
>>             //Writing data from HDFS to Local
>> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
>> "/home/kosmos/Work/input/wordpair1.txt");
>>             end = System.nanoTime();
>>             System.out.println("Total Execution times: " + (end - start));
>>             fs.close();
>>         } catch (Throwable t) {
>>             t.printStackTrace();
>>         }
>>     }
>> ******************************
>> I am trying to access in FireFox this url:
>>  hdfs://master:9000
>>
>>  Get an error msg FF does not know how to display this message.
>>
>>  I can successfully access my admin page:
>>
>>  http://localhost:50070/dfshealth.jsp
>>
>> Just wondering if anyone can give me any suggestions, your help will be
>> really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>
>



-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

Sai,

just use 127.0.0.1 in all the URIs you have. Less complicated and easily
replaceable


On Sun, Feb 24, 2013 at 5:37 PM, sudhakara st <su...@gmail.com>wrote:

> Hi,
>
> Execute ifcongf find the IP of system
> and add line in /etc/host
> (your ip) ubuntu
>
> use URI string  : public static String fsURI = "hdfs://ubuntu:9000";
>
>
> On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Many Thanks Nitin for your quick reply.
>>
>> Heres what i have in my hosts file and i am running in VM i m assuming it
>> is the pseudo mode:
>>
>> *********************
>> 127.0.0.1    localhost.localdomain    localhost
>> #::1    ubuntu    localhost6.localdomain6    localhost6
>> #127.0.1.1    ubuntu
>> 127.0.0.1   ubuntu
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1     localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>> *********************
>> In my masters i have:
>> ubuntu
>> In my slaves i have:
>> localhost
>> ***********************
>> My question is in my variable below:
>> public static String fsURI = "hdfs://master:9000";
>>
>> what would be the right value so i can connect to Hadoop successfully.
>> Please let me know if you need more info.
>> Thanks
>> Sai
>>
>>
>>
>>
>>
>>    ------------------------------
>> *From:* Nitin Pawar <ni...@gmail.com>
>> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> *Sent:* Sunday, 24 February 2013 3:42 AM
>> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>>
>> if you want to use master as your hostname then make such entry in your
>> /etc/hosts file
>>
>> or change the hdfs://master to hdfs://localhost
>>
>>
>> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Greetings,
>>
>> Below is the program i am trying to run and getting this exception:
>>  ***************************************
>> Test Start.....
>> java.net.UnknownHostException: unknown host: master
>>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>     at $Proxy1.getProtocolVersion(Unknown Source)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>     at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>>     at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>>
>>
>> ********************
>>
>> public class HdpTest {
>>
>>     public static String fsURI = "hdfs://master:9000";
>>
>>
>>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>>             String dstFile) throws IOException {
>>         try {
>>             System.out.println("Initialize copy...");
>>             URI suri = new URI(srcFile);
>>             URI duri = new URI(fsURI + "/" + dstFile);
>>             Path dst = new Path(duri.toString());
>>             Path src = new Path(suri.toString());
>>             System.out.println("Start copy...");
>>             fs.copyFromLocalFile(src, dst);
>>             System.out.println("End copy...");
>>         } catch (Exception e) {
>>             e.printStackTrace();
>>         }
>>     }
>>
>>     public static void main(String[] args) {
>>         try {
>>             System.out.println("Test Start.....");
>>             Configuration conf = new Configuration();
>>             DistributedFileSystem fs = new DistributedFileSystem();
>>             URI duri = new URI(fsURI);
>>             fs.initialize(duri, conf); // Here is the xception occuring
>>             long start = 0, end = 0;
>>             start = System.nanoTime();
>>             //writing data from local to HDFS
>>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>>                     "/input/raptor/trade1.txt");
>>             //Writing data from HDFS to Local
>> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
>> "/home/kosmos/Work/input/wordpair1.txt");
>>             end = System.nanoTime();
>>             System.out.println("Total Execution times: " + (end - start));
>>             fs.close();
>>         } catch (Throwable t) {
>>             t.printStackTrace();
>>         }
>>     }
>> ******************************
>> I am trying to access in FireFox this url:
>>  hdfs://master:9000
>>
>>  Get an error msg FF does not know how to display this message.
>>
>>  I can successfully access my admin page:
>>
>>  http://localhost:50070/dfshealth.jsp
>>
>> Just wondering if anyone can give me any suggestions, your help will be
>> really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>
>



-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

Sai,

just use 127.0.0.1 in all the URIs you have. Less complicated and easily
replaceable


On Sun, Feb 24, 2013 at 5:37 PM, sudhakara st <su...@gmail.com>wrote:

> Hi,
>
> Execute ifcongf find the IP of system
> and add line in /etc/host
> (your ip) ubuntu
>
> use URI string  : public static String fsURI = "hdfs://ubuntu:9000";
>
>
> On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Many Thanks Nitin for your quick reply.
>>
>> Heres what i have in my hosts file and i am running in VM i m assuming it
>> is the pseudo mode:
>>
>> *********************
>> 127.0.0.1    localhost.localdomain    localhost
>> #::1    ubuntu    localhost6.localdomain6    localhost6
>> #127.0.1.1    ubuntu
>> 127.0.0.1   ubuntu
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1     localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>> *********************
>> In my masters i have:
>> ubuntu
>> In my slaves i have:
>> localhost
>> ***********************
>> My question is in my variable below:
>> public static String fsURI = "hdfs://master:9000";
>>
>> what would be the right value so i can connect to Hadoop successfully.
>> Please let me know if you need more info.
>> Thanks
>> Sai
>>
>>
>>
>>
>>
>>    ------------------------------
>> *From:* Nitin Pawar <ni...@gmail.com>
>> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> *Sent:* Sunday, 24 February 2013 3:42 AM
>> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>>
>> if you want to use master as your hostname then make such entry in your
>> /etc/hosts file
>>
>> or change the hdfs://master to hdfs://localhost
>>
>>
>> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Greetings,
>>
>> Below is the program i am trying to run and getting this exception:
>>  ***************************************
>> Test Start.....
>> java.net.UnknownHostException: unknown host: master
>>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>     at $Proxy1.getProtocolVersion(Unknown Source)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>     at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>>     at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>>
>>
>> ********************
>>
>> public class HdpTest {
>>
>>     public static String fsURI = "hdfs://master:9000";
>>
>>
>>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>>             String dstFile) throws IOException {
>>         try {
>>             System.out.println("Initialize copy...");
>>             URI suri = new URI(srcFile);
>>             URI duri = new URI(fsURI + "/" + dstFile);
>>             Path dst = new Path(duri.toString());
>>             Path src = new Path(suri.toString());
>>             System.out.println("Start copy...");
>>             fs.copyFromLocalFile(src, dst);
>>             System.out.println("End copy...");
>>         } catch (Exception e) {
>>             e.printStackTrace();
>>         }
>>     }
>>
>>     public static void main(String[] args) {
>>         try {
>>             System.out.println("Test Start.....");
>>             Configuration conf = new Configuration();
>>             DistributedFileSystem fs = new DistributedFileSystem();
>>             URI duri = new URI(fsURI);
>>             fs.initialize(duri, conf); // Here is the xception occuring
>>             long start = 0, end = 0;
>>             start = System.nanoTime();
>>             //writing data from local to HDFS
>>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>>                     "/input/raptor/trade1.txt");
>>             //Writing data from HDFS to Local
>> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
>> "/home/kosmos/Work/input/wordpair1.txt");
>>             end = System.nanoTime();
>>             System.out.println("Total Execution times: " + (end - start));
>>             fs.close();
>>         } catch (Throwable t) {
>>             t.printStackTrace();
>>         }
>>     }
>> ******************************
>> I am trying to access in FireFox this url:
>>  hdfs://master:9000
>>
>>  Get an error msg FF does not know how to display this message.
>>
>>  I can successfully access my admin page:
>>
>>  http://localhost:50070/dfshealth.jsp
>>
>> Just wondering if anyone can give me any suggestions, your help will be
>> really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>
>



-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

Sai,

just use 127.0.0.1 in all the URIs you have. Less complicated and easily
replaceable


On Sun, Feb 24, 2013 at 5:37 PM, sudhakara st <su...@gmail.com>wrote:

> Hi,
>
> Execute ifcongf find the IP of system
> and add line in /etc/host
> (your ip) ubuntu
>
> use URI string  : public static String fsURI = "hdfs://ubuntu:9000";
>
>
> On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Many Thanks Nitin for your quick reply.
>>
>> Heres what i have in my hosts file and i am running in VM i m assuming it
>> is the pseudo mode:
>>
>> *********************
>> 127.0.0.1    localhost.localdomain    localhost
>> #::1    ubuntu    localhost6.localdomain6    localhost6
>> #127.0.1.1    ubuntu
>> 127.0.0.1   ubuntu
>>
>> # The following lines are desirable for IPv6 capable hosts
>> ::1     localhost ip6-localhost ip6-loopback
>> fe00::0 ip6-localnet
>> ff00::0 ip6-mcastprefix
>> ff02::1 ip6-allnodes
>> ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts
>> *********************
>> In my masters i have:
>> ubuntu
>> In my slaves i have:
>> localhost
>> ***********************
>> My question is in my variable below:
>> public static String fsURI = "hdfs://master:9000";
>>
>> what would be the right value so i can connect to Hadoop successfully.
>> Please let me know if you need more info.
>> Thanks
>> Sai
>>
>>
>>
>>
>>
>>    ------------------------------
>> *From:* Nitin Pawar <ni...@gmail.com>
>> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> *Sent:* Sunday, 24 February 2013 3:42 AM
>> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>>
>> if you want to use master as your hostname then make such entry in your
>> /etc/hosts file
>>
>> or change the hdfs://master to hdfs://localhost
>>
>>
>> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Greetings,
>>
>> Below is the program i am trying to run and getting this exception:
>>  ***************************************
>> Test Start.....
>> java.net.UnknownHostException: unknown host: master
>>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>     at $Proxy1.getProtocolVersion(Unknown Source)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>     at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>>     at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>>
>>
>> ********************
>>
>> public class HdpTest {
>>
>>     public static String fsURI = "hdfs://master:9000";
>>
>>
>>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>>             String dstFile) throws IOException {
>>         try {
>>             System.out.println("Initialize copy...");
>>             URI suri = new URI(srcFile);
>>             URI duri = new URI(fsURI + "/" + dstFile);
>>             Path dst = new Path(duri.toString());
>>             Path src = new Path(suri.toString());
>>             System.out.println("Start copy...");
>>             fs.copyFromLocalFile(src, dst);
>>             System.out.println("End copy...");
>>         } catch (Exception e) {
>>             e.printStackTrace();
>>         }
>>     }
>>
>>     public static void main(String[] args) {
>>         try {
>>             System.out.println("Test Start.....");
>>             Configuration conf = new Configuration();
>>             DistributedFileSystem fs = new DistributedFileSystem();
>>             URI duri = new URI(fsURI);
>>             fs.initialize(duri, conf); // Here is the xception occuring
>>             long start = 0, end = 0;
>>             start = System.nanoTime();
>>             //writing data from local to HDFS
>>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>>                     "/input/raptor/trade1.txt");
>>             //Writing data from HDFS to Local
>> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
>> "/home/kosmos/Work/input/wordpair1.txt");
>>             end = System.nanoTime();
>>             System.out.println("Total Execution times: " + (end - start));
>>             fs.close();
>>         } catch (Throwable t) {
>>             t.printStackTrace();
>>         }
>>     }
>> ******************************
>> I am trying to access in FireFox this url:
>>  hdfs://master:9000
>>
>>  Get an error msg FF does not know how to display this message.
>>
>>  I can successfully access my admin page:
>>
>>  http://localhost:50070/dfshealth.jsp
>>
>> Just wondering if anyone can give me any suggestions, your help will be
>> really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>
>
> --
>
> Regards,
> .....  Sudhakara.st
>
>



-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by sudhakara st <su...@gmail.com>.

Hi,

Execute ifcongf find the IP of system
and add line in /etc/host
(your ip) ubuntu

use URI string  : public static String fsURI = "hdfs://ubuntu:9000";

On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Many Thanks Nitin for your quick reply.
>
> Heres what i have in my hosts file and i am running in VM i m assuming it
> is the pseudo mode:
>
> *********************
> 127.0.0.1    localhost.localdomain    localhost
> #::1    ubuntu    localhost6.localdomain6    localhost6
> #127.0.1.1    ubuntu
> 127.0.0.1   ubuntu
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> *********************
> In my masters i have:
> ubuntu
> In my slaves i have:
> localhost
> ***********************
> My question is in my variable below:
> public static String fsURI = "hdfs://master:9000";
>
> what would be the right value so i can connect to Hadoop successfully.
> Please let me know if you need more info.
> Thanks
> Sai
>
>
>
>
>
>   ------------------------------
> *From:* Nitin Pawar <ni...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Sunday, 24 February 2013 3:42 AM
> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>
> if you want to use master as your hostname then make such entry in your
> /etc/hosts file
>
> or change the hdfs://master to hdfs://localhost
>
>
> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>
>
>
> --
> Nitin Pawar
>
>
>


-- 

Regards,
.....  Sudhakara.st

Re: Trying to copy file to Hadoop file system from a program

Posted by sudhakara st <su...@gmail.com>.

Hi,

Execute ifcongf find the IP of system
and add line in /etc/host
(your ip) ubuntu

use URI string  : public static String fsURI = "hdfs://ubuntu:9000";

On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Many Thanks Nitin for your quick reply.
>
> Heres what i have in my hosts file and i am running in VM i m assuming it
> is the pseudo mode:
>
> *********************
> 127.0.0.1    localhost.localdomain    localhost
> #::1    ubuntu    localhost6.localdomain6    localhost6
> #127.0.1.1    ubuntu
> 127.0.0.1   ubuntu
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> *********************
> In my masters i have:
> ubuntu
> In my slaves i have:
> localhost
> ***********************
> My question is in my variable below:
> public static String fsURI = "hdfs://master:9000";
>
> what would be the right value so i can connect to Hadoop successfully.
> Please let me know if you need more info.
> Thanks
> Sai
>
>
>
>
>
>   ------------------------------
> *From:* Nitin Pawar <ni...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Sunday, 24 February 2013 3:42 AM
> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>
> if you want to use master as your hostname then make such entry in your
> /etc/hosts file
>
> or change the hdfs://master to hdfs://localhost
>
>
> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>
>
>
> --
> Nitin Pawar
>
>
>


-- 

Regards,
.....  Sudhakara.st

Re: Unknown processes unable to terminate

Posted by Sai Sai <sa...@yahoo.in>.

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:



30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215 
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode


Thanks
Sai

Re: Unknown processes unable to terminate

Posted by Sai Sai <sa...@yahoo.in>.

I have a list of following processes given below, i am trying to kill the process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:



30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215 
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode


Thanks
Sai

Re: Trying to copy file to Hadoop file system from a program

Posted by sudhakara st <su...@gmail.com>.

Hi,

Execute ifcongf find the IP of system
and add line in /etc/host
(your ip) ubuntu

use URI string  : public static String fsURI = "hdfs://ubuntu:9000";

On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Many Thanks Nitin for your quick reply.
>
> Heres what i have in my hosts file and i am running in VM i m assuming it
> is the pseudo mode:
>
> *********************
> 127.0.0.1    localhost.localdomain    localhost
> #::1    ubuntu    localhost6.localdomain6    localhost6
> #127.0.1.1    ubuntu
> 127.0.0.1   ubuntu
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> *********************
> In my masters i have:
> ubuntu
> In my slaves i have:
> localhost
> ***********************
> My question is in my variable below:
> public static String fsURI = "hdfs://master:9000";
>
> what would be the right value so i can connect to Hadoop successfully.
> Please let me know if you need more info.
> Thanks
> Sai
>
>
>
>
>
>   ------------------------------
> *From:* Nitin Pawar <ni...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Sunday, 24 February 2013 3:42 AM
> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>
> if you want to use master as your hostname then make such entry in your
> /etc/hosts file
>
> or change the hdfs://master to hdfs://localhost
>
>
> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>
>
>
> --
> Nitin Pawar
>
>
>


-- 

Regards,
.....  Sudhakara.st

Re: Trying to copy file to Hadoop file system from a program

Posted by sudhakara st <su...@gmail.com>.

Hi,

Execute ifcongf find the IP of system
and add line in /etc/host
(your ip) ubuntu

use URI string  : public static String fsURI = "hdfs://ubuntu:9000";

On Sun, Feb 24, 2013 at 5:23 PM, Sai Sai <sa...@yahoo.in> wrote:

> Many Thanks Nitin for your quick reply.
>
> Heres what i have in my hosts file and i am running in VM i m assuming it
> is the pseudo mode:
>
> *********************
> 127.0.0.1    localhost.localdomain    localhost
> #::1    ubuntu    localhost6.localdomain6    localhost6
> #127.0.1.1    ubuntu
> 127.0.0.1   ubuntu
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> *********************
> In my masters i have:
> ubuntu
> In my slaves i have:
> localhost
> ***********************
> My question is in my variable below:
> public static String fsURI = "hdfs://master:9000";
>
> what would be the right value so i can connect to Hadoop successfully.
> Please let me know if you need more info.
> Thanks
> Sai
>
>
>
>
>
>   ------------------------------
> *From:* Nitin Pawar <ni...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Sunday, 24 February 2013 3:42 AM
> *Subject:* Re: Trying to copy file to Hadoop file system from a program
>
> if you want to use master as your hostname then make such entry in your
> /etc/hosts file
>
> or change the hdfs://master to hdfs://localhost
>
>
> On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>
>
>
> --
> Nitin Pawar
>
>
>


-- 

Regards,
.....  Sudhakara.st

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.

Many Thanks Nitin for your quick reply.

Heres what i have in my hosts file and i am running in VM i m assuming it is the pseudo mode:

*********************
127.0.0.1    localhost.localdomain    localhost
#::1    ubuntu    localhost6.localdomain6    localhost6
#127.0.1.1    ubuntu
127.0.0.1   ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
*********************
In my masters i have:
ubuntu
In my slaves i have:
localhost
***********************
My question is in my variable below:
public static String fsURI = "hdfs://master:9000";

what would be the right value so i can connect to Hadoop successfully.
Please let me know if you need more info.
Thanks
Sai







________________________________
 From: Nitin Pawar <ni...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Sunday, 24 February 2013 3:42 AM
Subject: Re: Trying to copy file to Hadoop file system from a program
 

if you want to use master as your hostname then make such entry in your /etc/hosts file 

or change the hdfs://master to hdfs://localhost 



On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Greetings,
>
>
>Below is the program i am trying to run and getting this exception:
>***************************************
>
>Test Start.....
>java.net.UnknownHostException: unknown host: master
>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>    at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
>
>
>********************
>
>
>
>public class HdpTest {
>    
>    public static String fsURI = "hdfs://master:9000";
>
>    
>    public static void copyFileToDFS(FileSystem fs, String srcFile,
>   
         String dstFile) throws IOException {
>        try {
>            System.out.println("Initialize copy...");
>            URI suri = new URI(srcFile);
>            URI duri = new URI(fsURI + "/" + dstFile);
>            Path dst = new Path(duri.toString());
>            Path src = new Path(suri.toString());
>            System.out.println("Start copy...");
>            fs.copyFromLocalFile(src, dst);
>            System.out.println("End copy...");
>        } catch (Exception e)
 {
>            e.printStackTrace();
>        }
>    }
>
>    public static void main(String[] args) {
>        try {
>            System.out.println("Test Start.....");
>            Configuration conf = new Configuration();
>            DistributedFileSystem fs = new DistributedFileSystem();
>            URI duri = new URI(fsURI);
>            fs.initialize(duri, conf); // Here is the xception occuring
>            long start = 0, end = 0;
>       
     start = System.nanoTime();
>            //writing data from local to HDFS
>            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                    "/input/raptor/trade1.txt");
>            //Writing data from HDFS to Local
>//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
>            end = System.nanoTime();
>            System.out.println("Total Execution times: " + (end - start));
>            fs.close();
>        } catch
 (Throwable t) {
>            t.printStackTrace();
>        }
>    }
>
>******************************
>I am trying to access in FireFox this url: 
>
>hdfs://master:9000
>
>
>Get an error msg FF does not know how to display this message.
>
>
>I can successfully access my admin page:
>
>
>http://localhost:50070/dfshealth.jsp
>
>
>Just wondering if anyone can give me any suggestions, your help will be really appreciated.
>ThanksSai
>
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.

Many Thanks Nitin for your quick reply.

Heres what i have in my hosts file and i am running in VM i m assuming it is the pseudo mode:

*********************
127.0.0.1    localhost.localdomain    localhost
#::1    ubuntu    localhost6.localdomain6    localhost6
#127.0.1.1    ubuntu
127.0.0.1   ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
*********************
In my masters i have:
ubuntu
In my slaves i have:
localhost
***********************
My question is in my variable below:
public static String fsURI = "hdfs://master:9000";

what would be the right value so i can connect to Hadoop successfully.
Please let me know if you need more info.
Thanks
Sai







________________________________
 From: Nitin Pawar <ni...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Sunday, 24 February 2013 3:42 AM
Subject: Re: Trying to copy file to Hadoop file system from a program
 

if you want to use master as your hostname then make such entry in your /etc/hosts file 

or change the hdfs://master to hdfs://localhost 



On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Greetings,
>
>
>Below is the program i am trying to run and getting this exception:
>***************************************
>
>Test Start.....
>java.net.UnknownHostException: unknown host: master
>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>    at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
>
>
>********************
>
>
>
>public class HdpTest {
>    
>    public static String fsURI = "hdfs://master:9000";
>
>    
>    public static void copyFileToDFS(FileSystem fs, String srcFile,
>   
         String dstFile) throws IOException {
>        try {
>            System.out.println("Initialize copy...");
>            URI suri = new URI(srcFile);
>            URI duri = new URI(fsURI + "/" + dstFile);
>            Path dst = new Path(duri.toString());
>            Path src = new Path(suri.toString());
>            System.out.println("Start copy...");
>            fs.copyFromLocalFile(src, dst);
>            System.out.println("End copy...");
>        } catch (Exception e)
 {
>            e.printStackTrace();
>        }
>    }
>
>    public static void main(String[] args) {
>        try {
>            System.out.println("Test Start.....");
>            Configuration conf = new Configuration();
>            DistributedFileSystem fs = new DistributedFileSystem();
>            URI duri = new URI(fsURI);
>            fs.initialize(duri, conf); // Here is the xception occuring
>            long start = 0, end = 0;
>       
     start = System.nanoTime();
>            //writing data from local to HDFS
>            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                    "/input/raptor/trade1.txt");
>            //Writing data from HDFS to Local
>//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
>            end = System.nanoTime();
>            System.out.println("Total Execution times: " + (end - start));
>            fs.close();
>        } catch
 (Throwable t) {
>            t.printStackTrace();
>        }
>    }
>
>******************************
>I am trying to access in FireFox this url: 
>
>hdfs://master:9000
>
>
>Get an error msg FF does not know how to display this message.
>
>
>I can successfully access my admin page:
>
>
>http://localhost:50070/dfshealth.jsp
>
>
>Just wondering if anyone can give me any suggestions, your help will be really appreciated.
>ThanksSai
>
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.

Many Thanks Nitin for your quick reply.

Heres what i have in my hosts file and i am running in VM i m assuming it is the pseudo mode:

*********************
127.0.0.1    localhost.localdomain    localhost
#::1    ubuntu    localhost6.localdomain6    localhost6
#127.0.1.1    ubuntu
127.0.0.1   ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
*********************
In my masters i have:
ubuntu
In my slaves i have:
localhost
***********************
My question is in my variable below:
public static String fsURI = "hdfs://master:9000";

what would be the right value so i can connect to Hadoop successfully.
Please let me know if you need more info.
Thanks
Sai







________________________________
 From: Nitin Pawar <ni...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Sunday, 24 February 2013 3:42 AM
Subject: Re: Trying to copy file to Hadoop file system from a program
 

if you want to use master as your hostname then make such entry in your /etc/hosts file 

or change the hdfs://master to hdfs://localhost 



On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Greetings,
>
>
>Below is the program i am trying to run and getting this exception:
>***************************************
>
>Test Start.....
>java.net.UnknownHostException: unknown host: master
>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>    at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
>
>
>********************
>
>
>
>public class HdpTest {
>    
>    public static String fsURI = "hdfs://master:9000";
>
>    
>    public static void copyFileToDFS(FileSystem fs, String srcFile,
>   
         String dstFile) throws IOException {
>        try {
>            System.out.println("Initialize copy...");
>            URI suri = new URI(srcFile);
>            URI duri = new URI(fsURI + "/" + dstFile);
>            Path dst = new Path(duri.toString());
>            Path src = new Path(suri.toString());
>            System.out.println("Start copy...");
>            fs.copyFromLocalFile(src, dst);
>            System.out.println("End copy...");
>        } catch (Exception e)
 {
>            e.printStackTrace();
>        }
>    }
>
>    public static void main(String[] args) {
>        try {
>            System.out.println("Test Start.....");
>            Configuration conf = new Configuration();
>            DistributedFileSystem fs = new DistributedFileSystem();
>            URI duri = new URI(fsURI);
>            fs.initialize(duri, conf); // Here is the xception occuring
>            long start = 0, end = 0;
>       
     start = System.nanoTime();
>            //writing data from local to HDFS
>            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                    "/input/raptor/trade1.txt");
>            //Writing data from HDFS to Local
>//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
>            end = System.nanoTime();
>            System.out.println("Total Execution times: " + (end - start));
>            fs.close();
>        } catch
 (Throwable t) {
>            t.printStackTrace();
>        }
>    }
>
>******************************
>I am trying to access in FireFox this url: 
>
>hdfs://master:9000
>
>
>Get an error msg FF does not know how to display this message.
>
>
>I can successfully access my admin page:
>
>
>http://localhost:50070/dfshealth.jsp
>
>
>Just wondering if anyone can give me any suggestions, your help will be really appreciated.
>ThanksSai
>
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.

Many Thanks Nitin for your quick reply.

Heres what i have in my hosts file and i am running in VM i m assuming it is the pseudo mode:

*********************
127.0.0.1    localhost.localdomain    localhost
#::1    ubuntu    localhost6.localdomain6    localhost6
#127.0.1.1    ubuntu
127.0.0.1   ubuntu

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
*********************
In my masters i have:
ubuntu
In my slaves i have:
localhost
***********************
My question is in my variable below:
public static String fsURI = "hdfs://master:9000";

what would be the right value so i can connect to Hadoop successfully.
Please let me know if you need more info.
Thanks
Sai







________________________________
 From: Nitin Pawar <ni...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Sunday, 24 February 2013 3:42 AM
Subject: Re: Trying to copy file to Hadoop file system from a program
 

if you want to use master as your hostname then make such entry in your /etc/hosts file 

or change the hdfs://master to hdfs://localhost 



On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Greetings,
>
>
>Below is the program i am trying to run and getting this exception:
>***************************************
>
>Test Start.....
>java.net.UnknownHostException: unknown host: master
>    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>    at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
>
>
>********************
>
>
>
>public class HdpTest {
>    
>    public static String fsURI = "hdfs://master:9000";
>
>    
>    public static void copyFileToDFS(FileSystem fs, String srcFile,
>   
         String dstFile) throws IOException {
>        try {
>            System.out.println("Initialize copy...");
>            URI suri = new URI(srcFile);
>            URI duri = new URI(fsURI + "/" + dstFile);
>            Path dst = new Path(duri.toString());
>            Path src = new Path(suri.toString());
>            System.out.println("Start copy...");
>            fs.copyFromLocalFile(src, dst);
>            System.out.println("End copy...");
>        } catch (Exception e)
 {
>            e.printStackTrace();
>        }
>    }
>
>    public static void main(String[] args) {
>        try {
>            System.out.println("Test Start.....");
>            Configuration conf = new Configuration();
>            DistributedFileSystem fs = new DistributedFileSystem();
>            URI duri = new URI(fsURI);
>            fs.initialize(duri, conf); // Here is the xception occuring
>            long start = 0, end = 0;
>       
     start = System.nanoTime();
>            //writing data from local to HDFS
>            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                    "/input/raptor/trade1.txt");
>            //Writing data from HDFS to Local
>//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
>            end = System.nanoTime();
>            System.out.println("Total Execution times: " + (end - start));
>            fs.close();
>        } catch
 (Throwable t) {
>            t.printStackTrace();
>        }
>    }
>
>******************************
>I am trying to access in FireFox this url: 
>
>hdfs://master:9000
>
>
>Get an error msg FF does not know how to display this message.
>
>
>I can successfully access my admin page:
>
>
>http://localhost:50070/dfshealth.jsp
>
>
>Just wondering if anyone can give me any suggestions, your help will be really appreciated.
>ThanksSai
>
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

if you want to use master as your hostname then make such entry in your
/etc/hosts file

or change the hdfs://master to hdfs://localhost


On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

if you want to use master as your hostname then make such entry in your
/etc/hosts file

or change the hdfs://master to hdfs://localhost


On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

if you want to use master as your hostname then make such entry in your
/etc/hosts file

or change the hdfs://master to hdfs://localhost


On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Nitin Pawar <ni...@gmail.com>.

if you want to use master as your hostname then make such entry in your
/etc/hosts file

or change the hdfs://master to hdfs://localhost


On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Greetings,
>
> Below is the program i am trying to run and getting this exception:
> ***************************************
> Test Start.....
> java.net.UnknownHostException: unknown host: master
>     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>     at $Proxy1.getProtocolVersion(Unknown Source)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>     at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>     at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
> ********************
>
> public class HdpTest {
>
>     public static String fsURI = "hdfs://master:9000";
>
>
>     public static void copyFileToDFS(FileSystem fs, String srcFile,
>             String dstFile) throws IOException {
>         try {
>             System.out.println("Initialize copy...");
>             URI suri = new URI(srcFile);
>             URI duri = new URI(fsURI + "/" + dstFile);
>             Path dst = new Path(duri.toString());
>             Path src = new Path(suri.toString());
>             System.out.println("Start copy...");
>             fs.copyFromLocalFile(src, dst);
>             System.out.println("End copy...");
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         try {
>             System.out.println("Test Start.....");
>             Configuration conf = new Configuration();
>             DistributedFileSystem fs = new DistributedFileSystem();
>             URI duri = new URI(fsURI);
>             fs.initialize(duri, conf); // Here is the xception occuring
>             long start = 0, end = 0;
>             start = System.nanoTime();
>             //writing data from local to HDFS
>             copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                     "/input/raptor/trade1.txt");
>             //Writing data from HDFS to Local
> //             copyFileFromDFS(fs, "/input/raptor/trade1.txt",
> "/home/kosmos/Work/input/wordpair1.txt");
>             end = System.nanoTime();
>             System.out.println("Total Execution times: " + (end - start));
>             fs.close();
>         } catch (Throwable t) {
>             t.printStackTrace();
>         }
>     }
> ******************************
> I am trying to access in FireFox this url:
>  hdfs://master:9000
>
>  Get an error msg FF does not know how to display this message.
>
>  I can successfully access my admin page:
>
>  http://localhost:50070/dfshealth.jsp
>
> Just wondering if anyone can give me any suggestions, your help will be
> really appreciated.
> Thanks
> Sai
>
>


-- 
Nitin Pawar

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.


Greetings,

Below is the program i am trying to run and getting this exception:
***************************************

Test Start.....
java.net.UnknownHostException: unknown host: master
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)



********************


public class HdpTest {
    
    public static String fsURI = "hdfs://master:9000";

    
    public static void copyFileToDFS(FileSystem fs, String srcFile,
            String dstFile) throws IOException {
        try {
            System.out.println("Initialize copy...");
            URI suri = new URI(srcFile);
            URI duri = new URI(fsURI + "/" + dstFile);
            Path dst = new Path(duri.toString());
            Path src = new Path(suri.toString());
            System.out.println("Start copy...");
            fs.copyFromLocalFile(src, dst);
            System.out.println("End copy...");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        try {
            System.out.println("Test Start.....");
            Configuration conf = new Configuration();
            DistributedFileSystem fs = new DistributedFileSystem();
            URI duri = new URI(fsURI);
            fs.initialize(duri, conf); // Here is the xception occuring
            long start = 0, end = 0;
            start = System.nanoTime();
            //writing data from local to HDFS
            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
                    "/input/raptor/trade1.txt");
            //Writing data from HDFS to Local
//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
            end = System.nanoTime();
            System.out.println("Total Execution times: " + (end - start));
            fs.close();
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

******************************
I am trying to access in FireFox this url: 

hdfs://master:9000

Get an error msg FF does not know how to display this message.

I can successfully access my admin page:

http://localhost:50070/dfshealth.jsp

Just wondering if anyone can give me any suggestions, your help will be really appreciated.
Thanks
Sai

Re: WordPairCount Mapreduce question.

Posted by Harsh J <ha...@cloudera.com>.

Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the
serialization/deserialization/format of data is comparable at the byte
level. One such provider is Apache Avro:
http://avro.apache.org/docs/current/spec.html#order.

Most other implementations simply deserialize again from the
bytestream and then compare, which has a higher (or, regular) cost.

On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija
<ba...@gmail.com> wrote:
> byte array comparison is for performance reasons only, but NOT the way you
> are thinking.
> This method comes from an interface called RawComparator which provides the
> prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2);) for this method.
> In the sorting phase where the keys are sorted, because of this
> implementation the records are read from the stream directly and sorted
> without the need to deserializing them into Objects.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> Thanks Mahesh for your help.
>>
>> Wondering if u can provide some insight with the below compare method
>> using byte[] in the SecondarySort example:
>>
>> public static class Comparator extends WritableComparator {
>>         public Comparator() {
>>             super(URICountKey.class);
>>         }
>>
>>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
>> int l2) {
>>             return compareBytes(b1, s1, l1, b2, s2, l2);
>>         }
>>     }
>>
>> My question is in the below compare method that i have given we are
>> comparing word1/word2
>> which makes sense but what about this byte[] comparison, is it right in
>> assuming  it converts each objects word1/word2/word3 to byte[] and compares
>> them.
>> If so is it for performance reason it is done.
>> Could you please verify.
>> Thanks
>> Sai
>> ________________________________
>> From: Mahesh Balija <ba...@gmail.com>
>> To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> Sent: Saturday, 23 February 2013 5:23 AM
>> Subject: Re: WordPairCount Mapreduce question.
>>
>> Please check the in-line answers...
>>
>> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Hello
>>
>> I have a question about how Mapreduce sorting works internally with
>> multiple columns.
>>
>> Below r my classes using 2 columns in an input file given below.
>>
>> 1st question: About the method hashCode, we r adding a "31 + ", i am
>> wondering why is this required. what does 31 refer to.
>>
>> This is how usually hashcode is calculated for any String instance
>> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
>> the String. Since in your case you only have 2 chars then it will be a *
>> 31^0 + b * 31^1.
>>
>>
>>
>> 2nd question: what if my input file has 3 columns instead of 2 how would
>> you write a compare method and was wondering if anyone can map this to a
>> real world scenario it will be really helpful.
>>
>> you will extend the same approach for the third column,
>>  public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>             if(diff==0){
>>                  diff = word3.compareTo(o.word3);
>>             }
>>         }
>>         return diff;
>>     }
>>
>>
>>
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>> ******************************
>>
>> Here is my input file wordpair.txt
>>
>> ******************************
>>
>> a    b
>> a    c
>> a    b
>> a    d
>> b    d
>> e    f
>> b    d
>> e    f
>> b    d
>>
>> **********************************
>>
>> Here is my WordPairObject:
>>
>> *********************************
>>
>> public class WordPairCountKey implements
>> WritableComparable<WordPairCountKey> {
>>
>>     private String word1;
>>     private String word2;
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>>
>>     public String getWord1() {
>>         return word1;
>>     }
>>
>>     public void setWord1(String word1) {
>>         this.word1 = word1;
>>     }
>>
>>     public String getWord2() {
>>         return word2;
>>     }
>>
>>     public void setWord2(String word2) {
>>         this.word2 = word2;
>>     }
>>
>>     @Override
>>     public void readFields(DataInput in) throws IOException {
>>         word1 = in.readUTF();
>>         word2 = in.readUTF();
>>     }
>>
>>     @Override
>>     public void write(DataOutput out) throws IOException {
>>         out.writeUTF(word1);
>>         out.writeUTF(word2);
>>     }
>>
>>
>>     @Override
>>     public String toString() {
>>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>>     }
>>
>> }
>>
>> ******************************
>>
>> Any help will be really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>



--
Harsh J

Re: WordPairCount Mapreduce question.

Posted by Harsh J <ha...@cloudera.com>.

Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the
serialization/deserialization/format of data is comparable at the byte
level. One such provider is Apache Avro:
http://avro.apache.org/docs/current/spec.html#order.

Most other implementations simply deserialize again from the
bytestream and then compare, which has a higher (or, regular) cost.

On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija
<ba...@gmail.com> wrote:
> byte array comparison is for performance reasons only, but NOT the way you
> are thinking.
> This method comes from an interface called RawComparator which provides the
> prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2);) for this method.
> In the sorting phase where the keys are sorted, because of this
> implementation the records are read from the stream directly and sorted
> without the need to deserializing them into Objects.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> Thanks Mahesh for your help.
>>
>> Wondering if u can provide some insight with the below compare method
>> using byte[] in the SecondarySort example:
>>
>> public static class Comparator extends WritableComparator {
>>         public Comparator() {
>>             super(URICountKey.class);
>>         }
>>
>>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
>> int l2) {
>>             return compareBytes(b1, s1, l1, b2, s2, l2);
>>         }
>>     }
>>
>> My question is in the below compare method that i have given we are
>> comparing word1/word2
>> which makes sense but what about this byte[] comparison, is it right in
>> assuming  it converts each objects word1/word2/word3 to byte[] and compares
>> them.
>> If so is it for performance reason it is done.
>> Could you please verify.
>> Thanks
>> Sai
>> ________________________________
>> From: Mahesh Balija <ba...@gmail.com>
>> To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> Sent: Saturday, 23 February 2013 5:23 AM
>> Subject: Re: WordPairCount Mapreduce question.
>>
>> Please check the in-line answers...
>>
>> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Hello
>>
>> I have a question about how Mapreduce sorting works internally with
>> multiple columns.
>>
>> Below r my classes using 2 columns in an input file given below.
>>
>> 1st question: About the method hashCode, we r adding a "31 + ", i am
>> wondering why is this required. what does 31 refer to.
>>
>> This is how usually hashcode is calculated for any String instance
>> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
>> the String. Since in your case you only have 2 chars then it will be a *
>> 31^0 + b * 31^1.
>>
>>
>>
>> 2nd question: what if my input file has 3 columns instead of 2 how would
>> you write a compare method and was wondering if anyone can map this to a
>> real world scenario it will be really helpful.
>>
>> you will extend the same approach for the third column,
>>  public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>             if(diff==0){
>>                  diff = word3.compareTo(o.word3);
>>             }
>>         }
>>         return diff;
>>     }
>>
>>
>>
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>> ******************************
>>
>> Here is my input file wordpair.txt
>>
>> ******************************
>>
>> a    b
>> a    c
>> a    b
>> a    d
>> b    d
>> e    f
>> b    d
>> e    f
>> b    d
>>
>> **********************************
>>
>> Here is my WordPairObject:
>>
>> *********************************
>>
>> public class WordPairCountKey implements
>> WritableComparable<WordPairCountKey> {
>>
>>     private String word1;
>>     private String word2;
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>>
>>     public String getWord1() {
>>         return word1;
>>     }
>>
>>     public void setWord1(String word1) {
>>         this.word1 = word1;
>>     }
>>
>>     public String getWord2() {
>>         return word2;
>>     }
>>
>>     public void setWord2(String word2) {
>>         this.word2 = word2;
>>     }
>>
>>     @Override
>>     public void readFields(DataInput in) throws IOException {
>>         word1 = in.readUTF();
>>         word2 = in.readUTF();
>>     }
>>
>>     @Override
>>     public void write(DataOutput out) throws IOException {
>>         out.writeUTF(word1);
>>         out.writeUTF(word2);
>>     }
>>
>>
>>     @Override
>>     public String toString() {
>>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>>     }
>>
>> }
>>
>> ******************************
>>
>> Any help will be really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>



--
Harsh J

Re: WordPairCount Mapreduce question.

Posted by Harsh J <ha...@cloudera.com>.

Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the
serialization/deserialization/format of data is comparable at the byte
level. One such provider is Apache Avro:
http://avro.apache.org/docs/current/spec.html#order.

Most other implementations simply deserialize again from the
bytestream and then compare, which has a higher (or, regular) cost.

On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija
<ba...@gmail.com> wrote:
> byte array comparison is for performance reasons only, but NOT the way you
> are thinking.
> This method comes from an interface called RawComparator which provides the
> prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2);) for this method.
> In the sorting phase where the keys are sorted, because of this
> implementation the records are read from the stream directly and sorted
> without the need to deserializing them into Objects.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> Thanks Mahesh for your help.
>>
>> Wondering if u can provide some insight with the below compare method
>> using byte[] in the SecondarySort example:
>>
>> public static class Comparator extends WritableComparator {
>>         public Comparator() {
>>             super(URICountKey.class);
>>         }
>>
>>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
>> int l2) {
>>             return compareBytes(b1, s1, l1, b2, s2, l2);
>>         }
>>     }
>>
>> My question is in the below compare method that i have given we are
>> comparing word1/word2
>> which makes sense but what about this byte[] comparison, is it right in
>> assuming  it converts each objects word1/word2/word3 to byte[] and compares
>> them.
>> If so is it for performance reason it is done.
>> Could you please verify.
>> Thanks
>> Sai
>> ________________________________
>> From: Mahesh Balija <ba...@gmail.com>
>> To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> Sent: Saturday, 23 February 2013 5:23 AM
>> Subject: Re: WordPairCount Mapreduce question.
>>
>> Please check the in-line answers...
>>
>> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Hello
>>
>> I have a question about how Mapreduce sorting works internally with
>> multiple columns.
>>
>> Below r my classes using 2 columns in an input file given below.
>>
>> 1st question: About the method hashCode, we r adding a "31 + ", i am
>> wondering why is this required. what does 31 refer to.
>>
>> This is how usually hashcode is calculated for any String instance
>> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
>> the String. Since in your case you only have 2 chars then it will be a *
>> 31^0 + b * 31^1.
>>
>>
>>
>> 2nd question: what if my input file has 3 columns instead of 2 how would
>> you write a compare method and was wondering if anyone can map this to a
>> real world scenario it will be really helpful.
>>
>> you will extend the same approach for the third column,
>>  public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>             if(diff==0){
>>                  diff = word3.compareTo(o.word3);
>>             }
>>         }
>>         return diff;
>>     }
>>
>>
>>
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>> ******************************
>>
>> Here is my input file wordpair.txt
>>
>> ******************************
>>
>> a    b
>> a    c
>> a    b
>> a    d
>> b    d
>> e    f
>> b    d
>> e    f
>> b    d
>>
>> **********************************
>>
>> Here is my WordPairObject:
>>
>> *********************************
>>
>> public class WordPairCountKey implements
>> WritableComparable<WordPairCountKey> {
>>
>>     private String word1;
>>     private String word2;
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>>
>>     public String getWord1() {
>>         return word1;
>>     }
>>
>>     public void setWord1(String word1) {
>>         this.word1 = word1;
>>     }
>>
>>     public String getWord2() {
>>         return word2;
>>     }
>>
>>     public void setWord2(String word2) {
>>         this.word2 = word2;
>>     }
>>
>>     @Override
>>     public void readFields(DataInput in) throws IOException {
>>         word1 = in.readUTF();
>>         word2 = in.readUTF();
>>     }
>>
>>     @Override
>>     public void write(DataOutput out) throws IOException {
>>         out.writeUTF(word1);
>>         out.writeUTF(word2);
>>     }
>>
>>
>>     @Override
>>     public String toString() {
>>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>>     }
>>
>> }
>>
>> ******************************
>>
>> Any help will be really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>



--
Harsh J

Re: WordPairCount Mapreduce question.

Posted by Harsh J <ha...@cloudera.com>.

Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the
serialization/deserialization/format of data is comparable at the byte
level. One such provider is Apache Avro:
http://avro.apache.org/docs/current/spec.html#order.

Most other implementations simply deserialize again from the
bytestream and then compare, which has a higher (or, regular) cost.

On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija
<ba...@gmail.com> wrote:
> byte array comparison is for performance reasons only, but NOT the way you
> are thinking.
> This method comes from an interface called RawComparator which provides the
> prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2);) for this method.
> In the sorting phase where the keys are sorted, because of this
> implementation the records are read from the stream directly and sorted
> without the need to deserializing them into Objects.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>> Thanks Mahesh for your help.
>>
>> Wondering if u can provide some insight with the below compare method
>> using byte[] in the SecondarySort example:
>>
>> public static class Comparator extends WritableComparator {
>>         public Comparator() {
>>             super(URICountKey.class);
>>         }
>>
>>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
>> int l2) {
>>             return compareBytes(b1, s1, l1, b2, s2, l2);
>>         }
>>     }
>>
>> My question is in the below compare method that i have given we are
>> comparing word1/word2
>> which makes sense but what about this byte[] comparison, is it right in
>> assuming  it converts each objects word1/word2/word3 to byte[] and compares
>> them.
>> If so is it for performance reason it is done.
>> Could you please verify.
>> Thanks
>> Sai
>> ________________________________
>> From: Mahesh Balija <ba...@gmail.com>
>> To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
>> Sent: Saturday, 23 February 2013 5:23 AM
>> Subject: Re: WordPairCount Mapreduce question.
>>
>> Please check the in-line answers...
>>
>> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>
>> Hello
>>
>> I have a question about how Mapreduce sorting works internally with
>> multiple columns.
>>
>> Below r my classes using 2 columns in an input file given below.
>>
>> 1st question: About the method hashCode, we r adding a "31 + ", i am
>> wondering why is this required. what does 31 refer to.
>>
>> This is how usually hashcode is calculated for any String instance
>> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
>> the String. Since in your case you only have 2 chars then it will be a *
>> 31^0 + b * 31^1.
>>
>>
>>
>> 2nd question: what if my input file has 3 columns instead of 2 how would
>> you write a compare method and was wondering if anyone can map this to a
>> real world scenario it will be really helpful.
>>
>> you will extend the same approach for the third column,
>>  public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>             if(diff==0){
>>                  diff = word3.compareTo(o.word3);
>>             }
>>         }
>>         return diff;
>>     }
>>
>>
>>
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>> ******************************
>>
>> Here is my input file wordpair.txt
>>
>> ******************************
>>
>> a    b
>> a    c
>> a    b
>> a    d
>> b    d
>> e    f
>> b    d
>> e    f
>> b    d
>>
>> **********************************
>>
>> Here is my WordPairObject:
>>
>> *********************************
>>
>> public class WordPairCountKey implements
>> WritableComparable<WordPairCountKey> {
>>
>>     private String word1;
>>     private String word2;
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>>
>>     public String getWord1() {
>>         return word1;
>>     }
>>
>>     public void setWord1(String word1) {
>>         this.word1 = word1;
>>     }
>>
>>     public String getWord2() {
>>         return word2;
>>     }
>>
>>     public void setWord2(String word2) {
>>         this.word2 = word2;
>>     }
>>
>>     @Override
>>     public void readFields(DataInput in) throws IOException {
>>         word1 = in.readUTF();
>>         word2 = in.readUTF();
>>     }
>>
>>     @Override
>>     public void write(DataOutput out) throws IOException {
>>         out.writeUTF(word1);
>>         out.writeUTF(word2);
>>     }
>>
>>
>>     @Override
>>     public String toString() {
>>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>>     }
>>
>> }
>>
>> ******************************
>>
>> Any help will be really appreciated.
>> Thanks
>> Sai
>>
>>
>>
>>
>



--
Harsh J

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

byte array comparison is for performance reasons only, but NOT the way you
are thinking.
This method comes from an interface called RawComparator which provides the
prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
int l2);) for this method.
In the sorting phase where the keys are sorted, because of this
implementation the records are read from the stream directly and sorted
without the need to deserializing them into Objects.

Best,
Mahesh Balija,
CalsoftLabs.

On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:

> Thanks Mahesh for your help.
>
> Wondering if u can provide some insight with the below compare method
> using byte[] in the SecondarySort example:
>
> public static class Comparator extends WritableComparator {
>         public Comparator() {
>             super(URICountKey.class);
>         }
>
>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2) {
>             return compareBytes(b1, s1, l1, b2, s2, l2);
>         }
>     }
>
> My question is in the below compare method that i have given we are
> comparing word1/word2
> which makes sense but what about this byte[] comparison, is it right in
> assuming  it converts each objects word1/word2/word3 to byte[] and compares
> them.
> If so is it for performance reason it is done.
> Could you please verify.
> Thanks
> Sai
>   ------------------------------
> *From:* Mahesh Balija <ba...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Saturday, 23 February 2013 5:23 AM
> *Subject:* Re: WordPairCount Mapreduce question.
>
> Please check the in-line answers...
>
> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
> This is how usually hashcode is calculated for any String instance
> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
> the String. Since in your case you only have 2 chars then it will be a *
> 31^0 + b * 31^1.
>
>
>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
> you will extend the same approach for the third column,
>  public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>             if(diff==0){
>                  diff = word3.compareTo(o.word3);
>             }
>          }
>         return diff;
>     }
>
>
>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>
>
>
>
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

byte array comparison is for performance reasons only, but NOT the way you
are thinking.
This method comes from an interface called RawComparator which provides the
prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
int l2);) for this method.
In the sorting phase where the keys are sorted, because of this
implementation the records are read from the stream directly and sorted
without the need to deserializing them into Objects.

Best,
Mahesh Balija,
CalsoftLabs.

On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:

> Thanks Mahesh for your help.
>
> Wondering if u can provide some insight with the below compare method
> using byte[] in the SecondarySort example:
>
> public static class Comparator extends WritableComparator {
>         public Comparator() {
>             super(URICountKey.class);
>         }
>
>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2) {
>             return compareBytes(b1, s1, l1, b2, s2, l2);
>         }
>     }
>
> My question is in the below compare method that i have given we are
> comparing word1/word2
> which makes sense but what about this byte[] comparison, is it right in
> assuming  it converts each objects word1/word2/word3 to byte[] and compares
> them.
> If so is it for performance reason it is done.
> Could you please verify.
> Thanks
> Sai
>   ------------------------------
> *From:* Mahesh Balija <ba...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Saturday, 23 February 2013 5:23 AM
> *Subject:* Re: WordPairCount Mapreduce question.
>
> Please check the in-line answers...
>
> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
> This is how usually hashcode is calculated for any String instance
> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
> the String. Since in your case you only have 2 chars then it will be a *
> 31^0 + b * 31^1.
>
>
>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
> you will extend the same approach for the third column,
>  public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>             if(diff==0){
>                  diff = word3.compareTo(o.word3);
>             }
>          }
>         return diff;
>     }
>
>
>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>
>
>
>
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

byte array comparison is for performance reasons only, but NOT the way you
are thinking.
This method comes from an interface called RawComparator which provides the
prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
int l2);) for this method.
In the sorting phase where the keys are sorted, because of this
implementation the records are read from the stream directly and sorted
without the need to deserializing them into Objects.

Best,
Mahesh Balija,
CalsoftLabs.

On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:

> Thanks Mahesh for your help.
>
> Wondering if u can provide some insight with the below compare method
> using byte[] in the SecondarySort example:
>
> public static class Comparator extends WritableComparator {
>         public Comparator() {
>             super(URICountKey.class);
>         }
>
>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2) {
>             return compareBytes(b1, s1, l1, b2, s2, l2);
>         }
>     }
>
> My question is in the below compare method that i have given we are
> comparing word1/word2
> which makes sense but what about this byte[] comparison, is it right in
> assuming  it converts each objects word1/word2/word3 to byte[] and compares
> them.
> If so is it for performance reason it is done.
> Could you please verify.
> Thanks
> Sai
>   ------------------------------
> *From:* Mahesh Balija <ba...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Saturday, 23 February 2013 5:23 AM
> *Subject:* Re: WordPairCount Mapreduce question.
>
> Please check the in-line answers...
>
> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
> This is how usually hashcode is calculated for any String instance
> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
> the String. Since in your case you only have 2 chars then it will be a *
> 31^0 + b * 31^1.
>
>
>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
> you will extend the same approach for the third column,
>  public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>             if(diff==0){
>                  diff = word3.compareTo(o.word3);
>             }
>          }
>         return diff;
>     }
>
>
>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>
>
>
>
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

byte array comparison is for performance reasons only, but NOT the way you
are thinking.
This method comes from an interface called RawComparator which provides the
prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
int l2);) for this method.
In the sorting phase where the keys are sorted, because of this
implementation the records are read from the stream directly and sorted
without the need to deserializing them into Objects.

Best,
Mahesh Balija,
CalsoftLabs.

On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <sa...@yahoo.in> wrote:

> Thanks Mahesh for your help.
>
> Wondering if u can provide some insight with the below compare method
> using byte[] in the SecondarySort example:
>
> public static class Comparator extends WritableComparator {
>         public Comparator() {
>             super(URICountKey.class);
>         }
>
>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2) {
>             return compareBytes(b1, s1, l1, b2, s2, l2);
>         }
>     }
>
> My question is in the below compare method that i have given we are
> comparing word1/word2
> which makes sense but what about this byte[] comparison, is it right in
> assuming  it converts each objects word1/word2/word3 to byte[] and compares
> them.
> If so is it for performance reason it is done.
> Could you please verify.
> Thanks
> Sai
>   ------------------------------
> *From:* Mahesh Balija <ba...@gmail.com>
> *To:* user@hadoop.apache.org; Sai Sai <sa...@yahoo.in>
> *Sent:* Saturday, 23 February 2013 5:23 AM
> *Subject:* Re: WordPairCount Mapreduce question.
>
> Please check the in-line answers...
>
> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:
>
>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
> This is how usually hashcode is calculated for any String instance
> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
> the String. Since in your case you only have 2 chars then it will be a *
> 31^0 + b * 31^1.
>
>
>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
> you will extend the same approach for the third column,
>  public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>             if(diff==0){
>                  diff = word3.compareTo(o.word3);
>             }
>          }
>         return diff;
>     }
>
>
>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>
>
>
>
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.

Thanks Mahesh for your help.

Wondering if u can provide some insight with the below compare method using byte[] in the SecondarySort example:

public static class Comparator extends WritableComparator {
        public Comparator() {
            super(URICountKey.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return compareBytes(b1, s1, l1, b2, s2, l2);
        }
    }


My question is in the below compare method that i have given we are comparing word1/word2
which makes sense but what about this byte[] comparison, is it right in assuming  it converts each objects word1/word2/word3 to byte[] and compares them.
If so is it for performance reason it is done.
Could you please verify.
Thanks
Sai


________________________________
 From: Mahesh Balija <ba...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 23 February 2013 5:23 AM
Subject: Re: WordPairCount Mapreduce question.
 

Please check the in-line answers...


On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Hello
>
>
>I have a question about how Mapreduce sorting works internally with multiple columns.
>
>
>Below r my classes using 2 columns in an input file given below.
>
>
>
>1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of the String. Since in your case you only have 2 chars then it will be a * 31^0 + b * 31^1.
 


>
>2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }
    

>
>
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>
>******************************
>
>Here is my input file wordpair.txt
>
>******************************
>
>a    b
>a    c
>a    b
>a    d
>b    d
>e    f
>b    d
>e    f
>b    d
>
>**********************************
>
>
>Here is my WordPairObject:
>
>*********************************
>
>public class WordPairCountKey implements WritableComparable<WordPairCountKey> {
>
>    private String word1;
>    private String word2;
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>    
>    public String getWord1() {
>        return word1;
>    }
>
>    public void setWord1(String word1) {
>        this.word1 = word1;
>    }
>
>    public String getWord2() {
>        return word2;
>    }
>
>    public void setWord2(String word2) {
>        this.word2 = word2;
>    }
>
>    @Override
>    public void readFields(DataInput in) throws IOException {
>        word1 = in.readUTF();
>        word2 = in.readUTF();
>    }
>
>    @Override
>    public void
 write(DataOutput out) throws IOException {
>        out.writeUTF(word1);
>        out.writeUTF(word2);
>    }
>
>    
>    @Override
>    public String toString() {
>        return "[word1=" + word1 + ", word2=" + word2 + "]";
>    }
>
>}
>
>******************************
>
>Any help will be really appreciated.
>Thanks
>Sai
>

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.


Greetings,

Below is the program i am trying to run and getting this exception:
***************************************

Test Start.....
java.net.UnknownHostException: unknown host: master
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)



********************


public class HdpTest {
    
    public static String fsURI = "hdfs://master:9000";

    
    public static void copyFileToDFS(FileSystem fs, String srcFile,
            String dstFile) throws IOException {
        try {
            System.out.println("Initialize copy...");
            URI suri = new URI(srcFile);
            URI duri = new URI(fsURI + "/" + dstFile);
            Path dst = new Path(duri.toString());
            Path src = new Path(suri.toString());
            System.out.println("Start copy...");
            fs.copyFromLocalFile(src, dst);
            System.out.println("End copy...");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        try {
            System.out.println("Test Start.....");
            Configuration conf = new Configuration();
            DistributedFileSystem fs = new DistributedFileSystem();
            URI duri = new URI(fsURI);
            fs.initialize(duri, conf); // Here is the xception occuring
            long start = 0, end = 0;
            start = System.nanoTime();
            //writing data from local to HDFS
            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
                    "/input/raptor/trade1.txt");
            //Writing data from HDFS to Local
//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
            end = System.nanoTime();
            System.out.println("Total Execution times: " + (end - start));
            fs.close();
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

******************************
I am trying to access in FireFox this url: 

hdfs://master:9000

Get an error msg FF does not know how to display this message.

I can successfully access my admin page:

http://localhost:50070/dfshealth.jsp

Just wondering if anyone can give me any suggestions, your help will be really appreciated.
Thanks
Sai

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.


Greetings,

Below is the program i am trying to run and getting this exception:
***************************************

Test Start.....
java.net.UnknownHostException: unknown host: master
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)



********************


public class HdpTest {
    
    public static String fsURI = "hdfs://master:9000";

    
    public static void copyFileToDFS(FileSystem fs, String srcFile,
            String dstFile) throws IOException {
        try {
            System.out.println("Initialize copy...");
            URI suri = new URI(srcFile);
            URI duri = new URI(fsURI + "/" + dstFile);
            Path dst = new Path(duri.toString());
            Path src = new Path(suri.toString());
            System.out.println("Start copy...");
            fs.copyFromLocalFile(src, dst);
            System.out.println("End copy...");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        try {
            System.out.println("Test Start.....");
            Configuration conf = new Configuration();
            DistributedFileSystem fs = new DistributedFileSystem();
            URI duri = new URI(fsURI);
            fs.initialize(duri, conf); // Here is the xception occuring
            long start = 0, end = 0;
            start = System.nanoTime();
            //writing data from local to HDFS
            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
                    "/input/raptor/trade1.txt");
            //Writing data from HDFS to Local
//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
            end = System.nanoTime();
            System.out.println("Total Execution times: " + (end - start));
            fs.close();
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

******************************
I am trying to access in FireFox this url: 

hdfs://master:9000

Get an error msg FF does not know how to display this message.

I can successfully access my admin page:

http://localhost:50070/dfshealth.jsp

Just wondering if anyone can give me any suggestions, your help will be really appreciated.
Thanks
Sai

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.

Thanks Mahesh for your help.

Wondering if u can provide some insight with the below compare method using byte[] in the SecondarySort example:

public static class Comparator extends WritableComparator {
        public Comparator() {
            super(URICountKey.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return compareBytes(b1, s1, l1, b2, s2, l2);
        }
    }


My question is in the below compare method that i have given we are comparing word1/word2
which makes sense but what about this byte[] comparison, is it right in assuming  it converts each objects word1/word2/word3 to byte[] and compares them.
If so is it for performance reason it is done.
Could you please verify.
Thanks
Sai


________________________________
 From: Mahesh Balija <ba...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 23 February 2013 5:23 AM
Subject: Re: WordPairCount Mapreduce question.
 

Please check the in-line answers...


On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Hello
>
>
>I have a question about how Mapreduce sorting works internally with multiple columns.
>
>
>Below r my classes using 2 columns in an input file given below.
>
>
>
>1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of the String. Since in your case you only have 2 chars then it will be a * 31^0 + b * 31^1.
 


>
>2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }
    

>
>
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>
>******************************
>
>Here is my input file wordpair.txt
>
>******************************
>
>a    b
>a    c
>a    b
>a    d
>b    d
>e    f
>b    d
>e    f
>b    d
>
>**********************************
>
>
>Here is my WordPairObject:
>
>*********************************
>
>public class WordPairCountKey implements WritableComparable<WordPairCountKey> {
>
>    private String word1;
>    private String word2;
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>    
>    public String getWord1() {
>        return word1;
>    }
>
>    public void setWord1(String word1) {
>        this.word1 = word1;
>    }
>
>    public String getWord2() {
>        return word2;
>    }
>
>    public void setWord2(String word2) {
>        this.word2 = word2;
>    }
>
>    @Override
>    public void readFields(DataInput in) throws IOException {
>        word1 = in.readUTF();
>        word2 = in.readUTF();
>    }
>
>    @Override
>    public void
 write(DataOutput out) throws IOException {
>        out.writeUTF(word1);
>        out.writeUTF(word2);
>    }
>
>    
>    @Override
>    public String toString() {
>        return "[word1=" + word1 + ", word2=" + word2 + "]";
>    }
>
>}
>
>******************************
>
>Any help will be really appreciated.
>Thanks
>Sai
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.

Thanks Mahesh for your help.

Wondering if u can provide some insight with the below compare method using byte[] in the SecondarySort example:

public static class Comparator extends WritableComparator {
        public Comparator() {
            super(URICountKey.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return compareBytes(b1, s1, l1, b2, s2, l2);
        }
    }


My question is in the below compare method that i have given we are comparing word1/word2
which makes sense but what about this byte[] comparison, is it right in assuming  it converts each objects word1/word2/word3 to byte[] and compares them.
If so is it for performance reason it is done.
Could you please verify.
Thanks
Sai


________________________________
 From: Mahesh Balija <ba...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 23 February 2013 5:23 AM
Subject: Re: WordPairCount Mapreduce question.
 

Please check the in-line answers...


On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Hello
>
>
>I have a question about how Mapreduce sorting works internally with multiple columns.
>
>
>Below r my classes using 2 columns in an input file given below.
>
>
>
>1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of the String. Since in your case you only have 2 chars then it will be a * 31^0 + b * 31^1.
 


>
>2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }
    

>
>
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>
>******************************
>
>Here is my input file wordpair.txt
>
>******************************
>
>a    b
>a    c
>a    b
>a    d
>b    d
>e    f
>b    d
>e    f
>b    d
>
>**********************************
>
>
>Here is my WordPairObject:
>
>*********************************
>
>public class WordPairCountKey implements WritableComparable<WordPairCountKey> {
>
>    private String word1;
>    private String word2;
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>    
>    public String getWord1() {
>        return word1;
>    }
>
>    public void setWord1(String word1) {
>        this.word1 = word1;
>    }
>
>    public String getWord2() {
>        return word2;
>    }
>
>    public void setWord2(String word2) {
>        this.word2 = word2;
>    }
>
>    @Override
>    public void readFields(DataInput in) throws IOException {
>        word1 = in.readUTF();
>        word2 = in.readUTF();
>    }
>
>    @Override
>    public void
 write(DataOutput out) throws IOException {
>        out.writeUTF(word1);
>        out.writeUTF(word2);
>    }
>
>    
>    @Override
>    public String toString() {
>        return "[word1=" + word1 + ", word2=" + word2 + "]";
>    }
>
>}
>
>******************************
>
>Any help will be really appreciated.
>Thanks
>Sai
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.

Thanks Mahesh for your help.

Wondering if u can provide some insight with the below compare method using byte[] in the SecondarySort example:

public static class Comparator extends WritableComparator {
        public Comparator() {
            super(URICountKey.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return compareBytes(b1, s1, l1, b2, s2, l2);
        }
    }


My question is in the below compare method that i have given we are comparing word1/word2
which makes sense but what about this byte[] comparison, is it right in assuming  it converts each objects word1/word2/word3 to byte[] and compares them.
If so is it for performance reason it is done.
Could you please verify.
Thanks
Sai


________________________________
 From: Mahesh Balija <ba...@gmail.com>
To: user@hadoop.apache.org; Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 23 February 2013 5:23 AM
Subject: Re: WordPairCount Mapreduce question.
 

Please check the in-line answers...


On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:


>
>Hello
>
>
>I have a question about how Mapreduce sorting works internally with multiple columns.
>
>
>Below r my classes using 2 columns in an input file given below.
>
>
>
>1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of the String. Since in your case you only have 2 chars then it will be a * 31^0 + b * 31^1.
 


>
>2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }
    

>
>
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>
>******************************
>
>Here is my input file wordpair.txt
>
>******************************
>
>a    b
>a    c
>a    b
>a    d
>b    d
>e    f
>b    d
>e    f
>b    d
>
>**********************************
>
>
>Here is my WordPairObject:
>
>*********************************
>
>public class WordPairCountKey implements WritableComparable<WordPairCountKey> {
>
>    private String word1;
>    private String word2;
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>    
>    public String getWord1() {
>        return word1;
>    }
>
>    public void setWord1(String word1) {
>        this.word1 = word1;
>    }
>
>    public String getWord2() {
>        return word2;
>    }
>
>    public void setWord2(String word2) {
>        this.word2 = word2;
>    }
>
>    @Override
>    public void readFields(DataInput in) throws IOException {
>        word1 = in.readUTF();
>        word2 = in.readUTF();
>    }
>
>    @Override
>    public void
 write(DataOutput out) throws IOException {
>        out.writeUTF(word1);
>        out.writeUTF(word2);
>    }
>
>    
>    @Override
>    public String toString() {
>        return "[word1=" + word1 + ", word2=" + word2 + "]";
>    }
>
>}
>
>******************************
>
>Any help will be really appreciated.
>Thanks
>Sai
>

Re: Trying to copy file to Hadoop file system from a program

Posted by Sai Sai <sa...@yahoo.in>.


Greetings,

Below is the program i am trying to run and getting this exception:
***************************************

Test Start.....
java.net.UnknownHostException: unknown host: master
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)



********************


public class HdpTest {
    
    public static String fsURI = "hdfs://master:9000";

    
    public static void copyFileToDFS(FileSystem fs, String srcFile,
            String dstFile) throws IOException {
        try {
            System.out.println("Initialize copy...");
            URI suri = new URI(srcFile);
            URI duri = new URI(fsURI + "/" + dstFile);
            Path dst = new Path(duri.toString());
            Path src = new Path(suri.toString());
            System.out.println("Start copy...");
            fs.copyFromLocalFile(src, dst);
            System.out.println("End copy...");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        try {
            System.out.println("Test Start.....");
            Configuration conf = new Configuration();
            DistributedFileSystem fs = new DistributedFileSystem();
            URI duri = new URI(fsURI);
            fs.initialize(duri, conf); // Here is the xception occuring
            long start = 0, end = 0;
            start = System.nanoTime();
            //writing data from local to HDFS
            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
                    "/input/raptor/trade1.txt");
            //Writing data from HDFS to Local
//             copyFileFromDFS(fs, "/input/raptor/trade1.txt", "/home/kosmos/Work/input/wordpair1.txt");
            end = System.nanoTime();
            System.out.println("Total Execution times: " + (end - start));
            fs.close();
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

******************************
I am trying to access in FireFox this url: 

hdfs://master:9000

Get an error msg FF does not know how to display this message.

I can successfully access my admin page:

http://localhost:50070/dfshealth.jsp

Just wondering if anyone can give me any suggestions, your help will be really appreciated.
Thanks
Sai

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

Please check the in-line answers...

On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance
(s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
the String. Since in your case you only have 2 chars then it will be a *
31^0 + b * 31^1.


>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }


>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

Please check the in-line answers...

On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance
(s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
the String. Since in your case you only have 2 chars then it will be a *
31^0 + b * 31^1.


>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }


>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

Please check the in-line answers...

On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance
(s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
the String. Since in your case you only have 2 chars then it will be a *
31^0 + b * 31^1.


>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }


>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>

Re: WordPairCount Mapreduce question.

Posted by Mahesh Balija <ba...@gmail.com>.

Please check the in-line answers...

On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hello
>
> I have a question about how Mapreduce sorting works internally with
> multiple columns.
>
> Below r my classes using 2 columns in an input file given below.
>
> 1st question: About the method hashCode, we r adding a "31 + ", i am
> wondering why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance
(s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
the String. Since in your case you only have 2 chars then it will be a *
31^0 + b * 31^1.


>
> 2nd question: what if my input file has 3 columns instead of 2 how would
> you write a compare method and was wondering if anyone can map this to a
> real world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
            if(diff==0){
                 diff = word3.compareTo(o.word3);
            }
        }
        return diff;
    }


>
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
> ******************************
>
> Here is my input file wordpair.txt
>
> ******************************
>
> a    b
> a    c
> a    b
> a    d
> b    d
> e    f
> b    d
> e    f
> b    d
>
> **********************************
>
> Here is my WordPairObject:
>
> *********************************
>
> public class WordPairCountKey implements
> WritableComparable<WordPairCountKey> {
>
>     private String word1;
>     private String word2;
>
>     @Override
>     public int compareTo(WordPairCountKey o) {
>         int diff = word1.compareTo(o.word1);
>         if (diff == 0) {
>             diff = word2.compareTo(o.word2);
>         }
>         return diff;
>     }
>
>     @Override
>     public int hashCode() {
>         return word1.hashCode() + 31 * word2.hashCode();
>     }
>
>
>     public String getWord1() {
>         return word1;
>     }
>
>     public void setWord1(String word1) {
>         this.word1 = word1;
>     }
>
>     public String getWord2() {
>         return word2;
>     }
>
>     public void setWord2(String word2) {
>         this.word2 = word2;
>     }
>
>     @Override
>     public void readFields(DataInput in) throws IOException {
>         word1 = in.readUTF();
>         word2 = in.readUTF();
>     }
>
>     @Override
>     public void write(DataOutput out) throws IOException {
>         out.writeUTF(word1);
>         out.writeUTF(word2);
>     }
>
>
>     @Override
>     public String toString() {
>         return "[word1=" + word1 + ", word2=" + word2 + "]";
>     }
>
> }
>
> ******************************
>
> Any help will be really appreciated.
> Thanks
> Sai
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.


Hello

I have a question about how Mapreduce sorting works internally with multiple columns.

Below r my classes using 2 columns in an input file given below.


1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.


2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.



    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

******************************

Here is my input file wordpair.txt

******************************

a    b
a    c
a    b
a    d
b    d
e    f
b    d
e    f
b    d

**********************************


Here is my WordPairObject:

*********************************

public class WordPairCountKey implements WritableComparable<WordPairCountKey> {

    private String word1;
    private String word2;

    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

    
    public String getWord1() {
        return word1;
    }

    public void setWord1(String word1) {
        this.word1 = word1;
    }

    public String getWord2() {
        return word2;
    }

    public void setWord2(String word2) {
        this.word2 = word2;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        word1 = in.readUTF();
        word2 = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(word1);
        out.writeUTF(word2);
    }

    
    @Override
    public String toString() {
        return "[word1=" + word1 + ", word2=" + word2 + "]";
    }

}

******************************

Any help will be really appreciated.
Thanks
Sai

Re: mapr videos question

Posted by Ted Dunning <td...@maprtech.com>.

The MapR videos on programming and map-reduce are all general videos.

The videos that cover capabilities like NFS, snapshots and mirrors are all
MapR specific since ordinary Hadoop distributions like Cloudera,
Hortonworks and Apache can't support those capabilities.

The videos that cover MapR administration are also MapR specific.

As a later poster suggested, specific links might help.

You can also ask MapR questions at answers.mapr.com if you want direct and
quick feedback on MapR topics like the videos.

On Sat, Feb 23, 2013 at 1:37 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>

Re: mapr videos question

Posted by Marco Shaw <ma...@gmail.com>.

Sorry. Can you provide some specific links?

Marco

On 2013-02-23, at 5:37 AM, Sai Sai <sa...@yahoo.in> wrote:

> 
> Hi
> Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>

Re: mapr videos question

Posted by Ted Dunning <td...@maprtech.com>.

The MapR videos on programming and map-reduce are all general videos.

The videos that cover capabilities like NFS, snapshots and mirrors are all
MapR specific since ordinary Hadoop distributions like Cloudera,
Hortonworks and Apache can't support those capabilities.

The videos that cover MapR administration are also MapR specific.

As a later poster suggested, specific links might help.

You can also ask MapR questions at answers.mapr.com if you want direct and
quick feedback on MapR topics like the videos.

On Sat, Feb 23, 2013 at 1:37 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>

Re: mapr videos question

Posted by Ted Dunning <td...@maprtech.com>.

The MapR videos on programming and map-reduce are all general videos.

The videos that cover capabilities like NFS, snapshots and mirrors are all
MapR specific since ordinary Hadoop distributions like Cloudera,
Hortonworks and Apache can't support those capabilities.

The videos that cover MapR administration are also MapR specific.

As a later poster suggested, specific links might help.

You can also ask MapR questions at answers.mapr.com if you want direct and
quick feedback on MapR topics like the videos.

On Sat, Feb 23, 2013 at 1:37 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.


Hello

I have a question about how Mapreduce sorting works internally with multiple columns.

Below r my classes using 2 columns in an input file given below.


1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.


2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.



    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

******************************

Here is my input file wordpair.txt

******************************

a    b
a    c
a    b
a    d
b    d
e    f
b    d
e    f
b    d

**********************************


Here is my WordPairObject:

*********************************

public class WordPairCountKey implements WritableComparable<WordPairCountKey> {

    private String word1;
    private String word2;

    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

    
    public String getWord1() {
        return word1;
    }

    public void setWord1(String word1) {
        this.word1 = word1;
    }

    public String getWord2() {
        return word2;
    }

    public void setWord2(String word2) {
        this.word2 = word2;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        word1 = in.readUTF();
        word2 = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(word1);
        out.writeUTF(word2);
    }

    
    @Override
    public String toString() {
        return "[word1=" + word1 + ", word2=" + word2 + "]";
    }

}

******************************

Any help will be really appreciated.
Thanks
Sai

Re: mapr videos question

Posted by Ted Dunning <td...@maprtech.com>.

The MapR videos on programming and map-reduce are all general videos.

The videos that cover capabilities like NFS, snapshots and mirrors are all
MapR specific since ordinary Hadoop distributions like Cloudera,
Hortonworks and Apache can't support those capabilities.

The videos that cover MapR administration are also MapR specific.

As a later poster suggested, specific links might help.

You can also ask MapR questions at answers.mapr.com if you want direct and
quick feedback on MapR topics like the videos.

On Sat, Feb 23, 2013 at 1:37 AM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>

Re: mapr videos question

Posted by Nitin Pawar <ni...@gmail.com>.

try this book

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974


On Sat, Feb 23, 2013 at 3:07 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>


-- 
Nitin Pawar

Re: mapr videos question

Posted by Marco Shaw <ma...@gmail.com>.

Sorry. Can you provide some specific links?

Marco

On 2013-02-23, at 5:37 AM, Sai Sai <sa...@yahoo.in> wrote:

> 
> Hi
> Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.


Hello

I have a question about how Mapreduce sorting works internally with multiple columns.

Below r my classes using 2 columns in an input file given below.


1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.


2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.



    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

******************************

Here is my input file wordpair.txt

******************************

a    b
a    c
a    b
a    d
b    d
e    f
b    d
e    f
b    d

**********************************


Here is my WordPairObject:

*********************************

public class WordPairCountKey implements WritableComparable<WordPairCountKey> {

    private String word1;
    private String word2;

    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

    
    public String getWord1() {
        return word1;
    }

    public void setWord1(String word1) {
        this.word1 = word1;
    }

    public String getWord2() {
        return word2;
    }

    public void setWord2(String word2) {
        this.word2 = word2;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        word1 = in.readUTF();
        word2 = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(word1);
        out.writeUTF(word2);
    }

    
    @Override
    public String toString() {
        return "[word1=" + word1 + ", word2=" + word2 + "]";
    }

}

******************************

Any help will be really appreciated.
Thanks
Sai

Re: mapr videos question

Posted by Nitin Pawar <ni...@gmail.com>.

try this book

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974


On Sat, Feb 23, 2013 at 3:07 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>


-- 
Nitin Pawar

Re: mapr videos question

Posted by Nitin Pawar <ni...@gmail.com>.

try this book

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974


On Sat, Feb 23, 2013 at 3:07 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>


-- 
Nitin Pawar

Re: mapr videos question

Posted by Nitin Pawar <ni...@gmail.com>.

try this book

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974


On Sat, Feb 23, 2013 at 3:07 PM, Sai Sai <sa...@yahoo.in> wrote:

>
> Hi
> Could some one please verify if the mapr videos are meant for learning
> hadoop or is it for learning mapr. If we r interested in learning hadoop
> only then will they help. As a starter would like to just understand hadoop
> only and not mapr yet.
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>
>


-- 
Nitin Pawar

Re: WordPairCount Mapreduce question.

Posted by Sai Sai <sa...@yahoo.in>.


Hello

I have a question about how Mapreduce sorting works internally with multiple columns.

Below r my classes using 2 columns in an input file given below.


1st question: About the method hashCode, we r adding a "31 + ", i am wondering why is this required. what does 31 refer to.


2nd question: what if my input file has 3 columns instead of 2 how would you write a compare method and was wondering if anyone can map this to a real world scenario it will be really helpful.



    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

******************************

Here is my input file wordpair.txt

******************************

a    b
a    c
a    b
a    d
b    d
e    f
b    d
e    f
b    d

**********************************


Here is my WordPairObject:

*********************************

public class WordPairCountKey implements WritableComparable<WordPairCountKey> {

    private String word1;
    private String word2;

    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

    
    public String getWord1() {
        return word1;
    }

    public void setWord1(String word1) {
        this.word1 = word1;
    }

    public String getWord2() {
        return word2;
    }

    public void setWord2(String word2) {
        this.word2 = word2;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        word1 = in.readUTF();
        word2 = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(word1);
        out.writeUTF(word2);
    }

    
    @Override
    public String toString() {
        return "[word1=" + word1 + ", word2=" + word2 + "]";
    }

}

******************************

Any help will be really appreciated.
Thanks
Sai

Re: mapr videos question

Posted by Marco Shaw <ma...@gmail.com>.

Sorry. Can you provide some specific links?

Marco

On 2013-02-23, at 5:37 AM, Sai Sai <sa...@yahoo.in> wrote:

> 
> Hi
> Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 
> Just wondering if others can share their thoughts and any relevant links.
> Thanks,
> Sai
>

Re: mapr videos question

Posted by Sai Sai <sa...@yahoo.in>.

Hi
Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 

Just wondering if others can share their thoughts and any relevant links.
Thanks,
Sai

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open =
dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <yh...@thoughtworks.com> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
>>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>>> work,
>>> >>> like I described, because the file looks empty for the map reduce
>>> >>> components. I thought it was about pig, but I wasn't sure, so I
>>> tried a
>>> >>> simple mr job, and used the word count to test the map reduce
>>> compoinents
>>> >>> actually see the sync'ed bytes.
>>> >>>
>>> >>> Of course if I close the file, everything works perfectly, but I
>>> don't
>>> >>> want to close the file every while, since that means I should create
>>> another
>>> >>> one (since no append support), and that would end up with too many
>>> tiny
>>> >>> files, something we know is bad for mr performance, and I don't want
>>> to add
>>> >>> more parts to this (like a file merging tool). I think unign sync is
>>> a clean
>>> >>> solution, since we don't care about writing performance, so I'd
>>> rather keep
>>> >>> it like this if I can make it work.
>>> >>>
>>> >>> Any idea besides hadoop version?
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> Lucas
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> >>> <yh...@thoughtworks.com> wrote:
>>> >>>>
>>> >>>> Hi Lucas,
>>> >>>>
>>> >>>> I tried something like this but got different results.
>>> >>>>
>>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>>> sync.
>>> >>>> Without closing the file, I ran a wordcount with that file as
>>> input. It did
>>> >>>> work fine and was able to count the words that were sync'ed (even
>>> though the
>>> >>>> file length seems to come as 0 like you noted in fs -ls)
>>> >>>>
>>> >>>> So, not sure what's happening in your case. In the MR job, do the
>>> job
>>> >>>> counters indicate no bytes were read ?
>>> >>>>
>>> >>>> On a different note though, if you can describe a little more what
>>> you
>>> >>>> are trying to accomplish, we could probably work a better solution.
>>> >>>>
>>> >>>> Thanks
>>> >>>> hemanth
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Helo Hemanth, thanks for answering.
>>> >>>>> The file is open by a separate process not map reduce related at
>>> all.
>>> >>>>> You can think of it as a servlet, receiving requests, and writing
>>> them to
>>> >>>>> this file, every time a request is received it is written and
>>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>> >>>>>
>>> >>>>> At the same time, I want to run a map reduce job over this file.
>>> Simply
>>> >>>>> runing the word count example doesn't seem to work, it is like if
>>> the file
>>> >>>>> were empty.
>>> >>>>>
>>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>> >>>>>
>>> >>>>> Last thing, the web interface doesn't see the contents, and command
>>> >>>>> hadoop -fs -ls says the file is empty.
>>> >>>>>
>>> >>>>> What am I doing wrong?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>> >>>>> <yh...@thoughtworks.com> wrote:
>>> >>>>>>
>>> >>>>>> Could you please clarify, are you opening the file in your mapper
>>> code
>>> >>>>>> and reading from there ?
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Hemanth
>>> >>>>>>
>>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>> >>>>>>>
>>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>>> open
>>> >>>>>>> file. The writing process, writes a line to the file and syncs
>>> the file to
>>> >>>>>>> readers.
>>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>> >>>>>>>
>>> >>>>>>> If I try to read the file from another process, it works fine, at
>>> >>>>>>> least using
>>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>> >>>>>>>
>>> >>>>>>> hadoop -fs -tail also works just fine
>>> >>>>>>>
>>> >>>>>>> But it looks like map reduce doesn't read any data. I tried
>>> using the
>>> >>>>>>> word count example, same thing, it is like if the file were
>>> empty for the
>>> >>>>>>> map reduce framework.
>>> >>>>>>>
>>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>> >>>>>>>
>>> >>>>>>> I need some help around this.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open =
dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <yh...@thoughtworks.com> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
>>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>>> work,
>>> >>> like I described, because the file looks empty for the map reduce
>>> >>> components. I thought it was about pig, but I wasn't sure, so I
>>> tried a
>>> >>> simple mr job, and used the word count to test the map reduce
>>> compoinents
>>> >>> actually see the sync'ed bytes.
>>> >>>
>>> >>> Of course if I close the file, everything works perfectly, but I
>>> don't
>>> >>> want to close the file every while, since that means I should create
>>> another
>>> >>> one (since no append support), and that would end up with too many
>>> tiny
>>> >>> files, something we know is bad for mr performance, and I don't want
>>> to add
>>> >>> more parts to this (like a file merging tool). I think unign sync is
>>> a clean
>>> >>> solution, since we don't care about writing performance, so I'd
>>> rather keep
>>> >>> it like this if I can make it work.
>>> >>>
>>> >>> Any idea besides hadoop version?
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> Lucas
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> >>> <yh...@thoughtworks.com> wrote:
>>> >>>>
>>> >>>> Hi Lucas,
>>> >>>>
>>> >>>> I tried something like this but got different results.
>>> >>>>
>>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>>> sync.
>>> >>>> Without closing the file, I ran a wordcount with that file as
>>> input. It did
>>> >>>> work fine and was able to count the words that were sync'ed (even
>>> though the
>>> >>>> file length seems to come as 0 like you noted in fs -ls)
>>> >>>>
>>> >>>> So, not sure what's happening in your case. In the MR job, do the
>>> job
>>> >>>> counters indicate no bytes were read ?
>>> >>>>
>>> >>>> On a different note though, if you can describe a little more what
>>> you
>>> >>>> are trying to accomplish, we could probably work a better solution.
>>> >>>>
>>> >>>> Thanks
>>> >>>> hemanth
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Helo Hemanth, thanks for answering.
>>> >>>>> The file is open by a separate process not map reduce related at
>>> all.
>>> >>>>> You can think of it as a servlet, receiving requests, and writing
>>> them to
>>> >>>>> this file, every time a request is received it is written and
>>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>> >>>>>
>>> >>>>> At the same time, I want to run a map reduce job over this file.
>>> Simply
>>> >>>>> runing the word count example doesn't seem to work, it is like if
>>> the file
>>> >>>>> were empty.
>>> >>>>>
>>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>> >>>>>
>>> >>>>> Last thing, the web interface doesn't see the contents, and command
>>> >>>>> hadoop -fs -ls says the file is empty.
>>> >>>>>
>>> >>>>> What am I doing wrong?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>> >>>>> <yh...@thoughtworks.com> wrote:
>>> >>>>>>
>>> >>>>>> Could you please clarify, are you opening the file in your mapper
>>> code
>>> >>>>>> and reading from there ?
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Hemanth
>>> >>>>>>
>>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>> >>>>>>>
>>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>>> open
>>> >>>>>>> file. The writing process, writes a line to the file and syncs
>>> the file to
>>> >>>>>>> readers.
>>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>> >>>>>>>
>>> >>>>>>> If I try to read the file from another process, it works fine, at
>>> >>>>>>> least using
>>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>> >>>>>>>
>>> >>>>>>> hadoop -fs -tail also works just fine
>>> >>>>>>>
>>> >>>>>>> But it looks like map reduce doesn't read any data. I tried
>>> using the
>>> >>>>>>> word count example, same thing, it is like if the file were
>>> empty for the
>>> >>>>>>> map reduce framework.
>>> >>>>>>>
>>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>> >>>>>>>
>>> >>>>>>> I need some help around this.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open =
dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <yh...@thoughtworks.com> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
>>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>>> work,
>>> >>> like I described, because the file looks empty for the map reduce
>>> >>> components. I thought it was about pig, but I wasn't sure, so I
>>> tried a
>>> >>> simple mr job, and used the word count to test the map reduce
>>> compoinents
>>> >>> actually see the sync'ed bytes.
>>> >>>
>>> >>> Of course if I close the file, everything works perfectly, but I
>>> don't
>>> >>> want to close the file every while, since that means I should create
>>> another
>>> >>> one (since no append support), and that would end up with too many
>>> tiny
>>> >>> files, something we know is bad for mr performance, and I don't want
>>> to add
>>> >>> more parts to this (like a file merging tool). I think unign sync is
>>> a clean
>>> >>> solution, since we don't care about writing performance, so I'd
>>> rather keep
>>> >>> it like this if I can make it work.
>>> >>>
>>> >>> Any idea besides hadoop version?
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> Lucas
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> >>> <yh...@thoughtworks.com> wrote:
>>> >>>>
>>> >>>> Hi Lucas,
>>> >>>>
>>> >>>> I tried something like this but got different results.
>>> >>>>
>>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>>> sync.
>>> >>>> Without closing the file, I ran a wordcount with that file as
>>> input. It did
>>> >>>> work fine and was able to count the words that were sync'ed (even
>>> though the
>>> >>>> file length seems to come as 0 like you noted in fs -ls)
>>> >>>>
>>> >>>> So, not sure what's happening in your case. In the MR job, do the
>>> job
>>> >>>> counters indicate no bytes were read ?
>>> >>>>
>>> >>>> On a different note though, if you can describe a little more what
>>> you
>>> >>>> are trying to accomplish, we could probably work a better solution.
>>> >>>>
>>> >>>> Thanks
>>> >>>> hemanth
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Helo Hemanth, thanks for answering.
>>> >>>>> The file is open by a separate process not map reduce related at
>>> all.
>>> >>>>> You can think of it as a servlet, receiving requests, and writing
>>> them to
>>> >>>>> this file, every time a request is received it is written and
>>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>> >>>>>
>>> >>>>> At the same time, I want to run a map reduce job over this file.
>>> Simply
>>> >>>>> runing the word count example doesn't seem to work, it is like if
>>> the file
>>> >>>>> were empty.
>>> >>>>>
>>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>> >>>>>
>>> >>>>> Last thing, the web interface doesn't see the contents, and command
>>> >>>>> hadoop -fs -ls says the file is empty.
>>> >>>>>
>>> >>>>> What am I doing wrong?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>> >>>>> <yh...@thoughtworks.com> wrote:
>>> >>>>>>
>>> >>>>>> Could you please clarify, are you opening the file in your mapper
>>> code
>>> >>>>>> and reading from there ?
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Hemanth
>>> >>>>>>
>>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>> >>>>>>>
>>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>>> open
>>> >>>>>>> file. The writing process, writes a line to the file and syncs
>>> the file to
>>> >>>>>>> readers.
>>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>> >>>>>>>
>>> >>>>>>> If I try to read the file from another process, it works fine, at
>>> >>>>>>> least using
>>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>> >>>>>>>
>>> >>>>>>> hadoop -fs -tail also works just fine
>>> >>>>>>>
>>> >>>>>>> But it looks like map reduce doesn't read any data. I tried
>>> using the
>>> >>>>>>> word count example, same thing, it is like if the file were
>>> empty for the
>>> >>>>>>> map reduce framework.
>>> >>>>>>>
>>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>> >>>>>>>
>>> >>>>>>> I need some help around this.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.

To fix this problem I changed the way the file length is calculated,
overriding the listStatus mehtod in a new InputFormat implementation, which
inherits from FileInputFormat:

    @Override
    protected List<FileStatus> listStatus(JobContext job) throws
IOException {
        List<FileStatus> listStatus = super.listStatus(job);
        List<FileStatus> result = Lists.newArrayList();
        DFSClient dfsClient = null;
        try {
            dfsClient = new DFSClient(job.getConfiguration());
            for (FileStatus fileStatus : listStatus) {
                long len = fileStatus.getLen();
                if (len == 0) {
                    DFSInputStream open =
dfsClient.open(fileStatus.getPath().toUri().getPath());
                    long fileLength = open.getFileLength();
                    open.close();
                    FileStatus fileStatus2 = new FileStatus(fileLength,
fileStatus.isDir(), fileStatus.getReplication(),
                        fileStatus.getBlockSize(),
fileStatus.getModificationTime(), fileStatus.getAccessTime(),
                        fileStatus.getPermission(), fileStatus.getOwner(),
fileStatus.getGroup(), fileStatus.getPath());
                    result.add(fileStatus2);
                } else {
                    result.add(fileStatus);
                }
            }
        } finally {
            if (dfsClient != null) {
                dfsClient.close();
            }
        }
        return result;
    }

this worked just fine for me.

What do you think?

Thanks!
Lucas

On Mon, Feb 25, 2013 at 7:03 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
> That also would explain the weird behavior of tail, which seems to always
> jump to the start since file length is 0.
>
> So, basically, sync doesn't update file length, any code based on file
> size, is unreliable.
>
> Am I right?
>
> How can I get around this?
>
> Lucas
>
>
> On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> I didn't notice, thanks for the heads up.
>>
>>
>> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Just an aside (I've not tried to look at the original issue yet), but
>>> Scribe has not been maintained (nor has seen a release) in over a year
>>> now -- looking at the commit history. Same case with both Facebook and
>>> Twitter's fork.
>>>
>>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> > Yeah I looked at scribe, looks good but sounds like too much for my
>>> problem.
>>> > I'd rather make it work the simple way. Could you pleas post your
>>> code, may
>>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>>> block
>>> > size or some other  parameter is different...
>>> >
>>> > Thanks!
>>> > Lucas
>>> >
>>> >
>>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>>> > <yh...@thoughtworks.com> wrote:
>>> >>
>>> >> I am using the same version of Hadoop as you.
>>> >>
>>> >> Can you look at something like Scribe, which AFAIK fits the use case
>>> you
>>> >> describe.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> That is exactly what I did, but in my case, it is like if the file
>>> were
>>> >>> empty, the job counters say no bytes read.
>>> >>> I'm using hadoop 1.0.3 which version did you try?
>>> >>>
>>> >>> What I'm trying to do is just some basic analyitics on a product
>>> search
>>> >>> system. There is a search service, every time a user performs a
>>> search, the
>>> >>> search string, and the results are stored in this file, and the file
>>> is
>>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>>> work,
>>> >>> like I described, because the file looks empty for the map reduce
>>> >>> components. I thought it was about pig, but I wasn't sure, so I
>>> tried a
>>> >>> simple mr job, and used the word count to test the map reduce
>>> compoinents
>>> >>> actually see the sync'ed bytes.
>>> >>>
>>> >>> Of course if I close the file, everything works perfectly, but I
>>> don't
>>> >>> want to close the file every while, since that means I should create
>>> another
>>> >>> one (since no append support), and that would end up with too many
>>> tiny
>>> >>> files, something we know is bad for mr performance, and I don't want
>>> to add
>>> >>> more parts to this (like a file merging tool). I think unign sync is
>>> a clean
>>> >>> solution, since we don't care about writing performance, so I'd
>>> rather keep
>>> >>> it like this if I can make it work.
>>> >>>
>>> >>> Any idea besides hadoop version?
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> Lucas
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> >>> <yh...@thoughtworks.com> wrote:
>>> >>>>
>>> >>>> Hi Lucas,
>>> >>>>
>>> >>>> I tried something like this but got different results.
>>> >>>>
>>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>>> sync.
>>> >>>> Without closing the file, I ran a wordcount with that file as
>>> input. It did
>>> >>>> work fine and was able to count the words that were sync'ed (even
>>> though the
>>> >>>> file length seems to come as 0 like you noted in fs -ls)
>>> >>>>
>>> >>>> So, not sure what's happening in your case. In the MR job, do the
>>> job
>>> >>>> counters indicate no bytes were read ?
>>> >>>>
>>> >>>> On a different note though, if you can describe a little more what
>>> you
>>> >>>> are trying to accomplish, we could probably work a better solution.
>>> >>>>
>>> >>>> Thanks
>>> >>>> hemanth
>>> >>>>
>>> >>>>
>>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Helo Hemanth, thanks for answering.
>>> >>>>> The file is open by a separate process not map reduce related at
>>> all.
>>> >>>>> You can think of it as a servlet, receiving requests, and writing
>>> them to
>>> >>>>> this file, every time a request is received it is written and
>>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>> >>>>>
>>> >>>>> At the same time, I want to run a map reduce job over this file.
>>> Simply
>>> >>>>> runing the word count example doesn't seem to work, it is like if
>>> the file
>>> >>>>> were empty.
>>> >>>>>
>>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>> >>>>>
>>> >>>>> Last thing, the web interface doesn't see the contents, and command
>>> >>>>> hadoop -fs -ls says the file is empty.
>>> >>>>>
>>> >>>>> What am I doing wrong?
>>> >>>>>
>>> >>>>> Thanks!
>>> >>>>>
>>> >>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>> >>>>> <yh...@thoughtworks.com> wrote:
>>> >>>>>>
>>> >>>>>> Could you please clarify, are you opening the file in your mapper
>>> code
>>> >>>>>> and reading from there ?
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>> Hemanth
>>> >>>>>>
>>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>> >>>>>>>
>>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>>> open
>>> >>>>>>> file. The writing process, writes a line to the file and syncs
>>> the file to
>>> >>>>>>> readers.
>>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>> >>>>>>>
>>> >>>>>>> If I try to read the file from another process, it works fine, at
>>> >>>>>>> least using
>>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>> >>>>>>>
>>> >>>>>>> hadoop -fs -tail also works just fine
>>> >>>>>>>
>>> >>>>>>> But it looks like map reduce doesn't read any data. I tried
>>> using the
>>> >>>>>>> word count example, same thing, it is like if the file were
>>> empty for the
>>> >>>>>>> map reduce framework.
>>> >>>>>>>
>>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>> >>>>>>>
>>> >>>>>>> I need some help around this.
>>> >>>>>>>
>>> >>>>>>> Thanks!
>>> >>>>>>>
>>> >>>>>>> Lucas
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
That also would explain the weird behavior of tail, which seems to always
jump to the start since file length is 0.

So, basically, sync doesn't update file length, any code based on file
size, is unreliable.

Am I right?

How can I get around this?

Lucas

On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> I didn't notice, thanks for the heads up.
>
>
> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Just an aside (I've not tried to look at the original issue yet), but
>> Scribe has not been maintained (nor has seen a release) in over a year
>> now -- looking at the commit history. Same case with both Facebook and
>> Twitter's fork.
>>
>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>> > Yeah I looked at scribe, looks good but sounds like too much for my
>> problem.
>> > I'd rather make it work the simple way. Could you pleas post your code,
>> may
>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>> block
>> > size or some other  parameter is different...
>> >
>> > Thanks!
>> > Lucas
>> >
>> >
>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>> > <yh...@thoughtworks.com> wrote:
>> >>
>> >> I am using the same version of Hadoop as you.
>> >>
>> >> Can you look at something like Scribe, which AFAIK fits the use case
>> you
>> >> describe.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>> wrote:
>> >>>
>> >>> That is exactly what I did, but in my case, it is like if the file
>> were
>> >>> empty, the job counters say no bytes read.
>> >>> I'm using hadoop 1.0.3 which version did you try?
>> >>>
>> >>> What I'm trying to do is just some basic analyitics on a product
>> search
>> >>> system. There is a search service, every time a user performs a
>> search, the
>> >>> search string, and the results are stored in this file, and the file
>> is
>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>> work,
>> >>> like I described, because the file looks empty for the map reduce
>> >>> components. I thought it was about pig, but I wasn't sure, so I tried
>> a
>> >>> simple mr job, and used the word count to test the map reduce
>> compoinents
>> >>> actually see the sync'ed bytes.
>> >>>
>> >>> Of course if I close the file, everything works perfectly, but I don't
>> >>> want to close the file every while, since that means I should create
>> another
>> >>> one (since no append support), and that would end up with too many
>> tiny
>> >>> files, something we know is bad for mr performance, and I don't want
>> to add
>> >>> more parts to this (like a file merging tool). I think unign sync is
>> a clean
>> >>> solution, since we don't care about writing performance, so I'd
>> rather keep
>> >>> it like this if I can make it work.
>> >>>
>> >>> Any idea besides hadoop version?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Lucas
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>> >>> <yh...@thoughtworks.com> wrote:
>> >>>>
>> >>>> Hi Lucas,
>> >>>>
>> >>>> I tried something like this but got different results.
>> >>>>
>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>> sync.
>> >>>> Without closing the file, I ran a wordcount with that file as input.
>> It did
>> >>>> work fine and was able to count the words that were sync'ed (even
>> though the
>> >>>> file length seems to come as 0 like you noted in fs -ls)
>> >>>>
>> >>>> So, not sure what's happening in your case. In the MR job, do the job
>> >>>> counters indicate no bytes were read ?
>> >>>>
>> >>>> On a different note though, if you can describe a little more what
>> you
>> >>>> are trying to accomplish, we could probably work a better solution.
>> >>>>
>> >>>> Thanks
>> >>>> hemanth
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Helo Hemanth, thanks for answering.
>> >>>>> The file is open by a separate process not map reduce related at
>> all.
>> >>>>> You can think of it as a servlet, receiving requests, and writing
>> them to
>> >>>>> this file, every time a request is received it is written and
>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>> >>>>>
>> >>>>> At the same time, I want to run a map reduce job over this file.
>> Simply
>> >>>>> runing the word count example doesn't seem to work, it is like if
>> the file
>> >>>>> were empty.
>> >>>>>
>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>> >>>>>
>> >>>>> Last thing, the web interface doesn't see the contents, and command
>> >>>>> hadoop -fs -ls says the file is empty.
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>> >>>>> <yh...@thoughtworks.com> wrote:
>> >>>>>>
>> >>>>>> Could you please clarify, are you opening the file in your mapper
>> code
>> >>>>>> and reading from there ?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Hemanth
>> >>>>>>
>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>> >>>>>>>
>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>> open
>> >>>>>>> file. The writing process, writes a line to the file and syncs
>> the file to
>> >>>>>>> readers.
>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>> >>>>>>>
>> >>>>>>> If I try to read the file from another process, it works fine, at
>> >>>>>>> least using
>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>> >>>>>>>
>> >>>>>>> hadoop -fs -tail also works just fine
>> >>>>>>>
>> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
>> the
>> >>>>>>> word count example, same thing, it is like if the file were empty
>> for the
>> >>>>>>> map reduce framework.
>> >>>>>>>
>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>> >>>>>>>
>> >>>>>>> I need some help around this.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
That also would explain the weird behavior of tail, which seems to always
jump to the start since file length is 0.

So, basically, sync doesn't update file length, any code based on file
size, is unreliable.

Am I right?

How can I get around this?

Lucas

On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> I didn't notice, thanks for the heads up.
>
>
> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Just an aside (I've not tried to look at the original issue yet), but
>> Scribe has not been maintained (nor has seen a release) in over a year
>> now -- looking at the commit history. Same case with both Facebook and
>> Twitter's fork.
>>
>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>> > Yeah I looked at scribe, looks good but sounds like too much for my
>> problem.
>> > I'd rather make it work the simple way. Could you pleas post your code,
>> may
>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>> block
>> > size or some other  parameter is different...
>> >
>> > Thanks!
>> > Lucas
>> >
>> >
>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>> > <yh...@thoughtworks.com> wrote:
>> >>
>> >> I am using the same version of Hadoop as you.
>> >>
>> >> Can you look at something like Scribe, which AFAIK fits the use case
>> you
>> >> describe.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>> wrote:
>> >>>
>> >>> That is exactly what I did, but in my case, it is like if the file
>> were
>> >>> empty, the job counters say no bytes read.
>> >>> I'm using hadoop 1.0.3 which version did you try?
>> >>>
>> >>> What I'm trying to do is just some basic analyitics on a product
>> search
>> >>> system. There is a search service, every time a user performs a
>> search, the
>> >>> search string, and the results are stored in this file, and the file
>> is
>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>> work,
>> >>> like I described, because the file looks empty for the map reduce
>> >>> components. I thought it was about pig, but I wasn't sure, so I tried
>> a
>> >>> simple mr job, and used the word count to test the map reduce
>> compoinents
>> >>> actually see the sync'ed bytes.
>> >>>
>> >>> Of course if I close the file, everything works perfectly, but I don't
>> >>> want to close the file every while, since that means I should create
>> another
>> >>> one (since no append support), and that would end up with too many
>> tiny
>> >>> files, something we know is bad for mr performance, and I don't want
>> to add
>> >>> more parts to this (like a file merging tool). I think unign sync is
>> a clean
>> >>> solution, since we don't care about writing performance, so I'd
>> rather keep
>> >>> it like this if I can make it work.
>> >>>
>> >>> Any idea besides hadoop version?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Lucas
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>> >>> <yh...@thoughtworks.com> wrote:
>> >>>>
>> >>>> Hi Lucas,
>> >>>>
>> >>>> I tried something like this but got different results.
>> >>>>
>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>> sync.
>> >>>> Without closing the file, I ran a wordcount with that file as input.
>> It did
>> >>>> work fine and was able to count the words that were sync'ed (even
>> though the
>> >>>> file length seems to come as 0 like you noted in fs -ls)
>> >>>>
>> >>>> So, not sure what's happening in your case. In the MR job, do the job
>> >>>> counters indicate no bytes were read ?
>> >>>>
>> >>>> On a different note though, if you can describe a little more what
>> you
>> >>>> are trying to accomplish, we could probably work a better solution.
>> >>>>
>> >>>> Thanks
>> >>>> hemanth
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Helo Hemanth, thanks for answering.
>> >>>>> The file is open by a separate process not map reduce related at
>> all.
>> >>>>> You can think of it as a servlet, receiving requests, and writing
>> them to
>> >>>>> this file, every time a request is received it is written and
>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>> >>>>>
>> >>>>> At the same time, I want to run a map reduce job over this file.
>> Simply
>> >>>>> runing the word count example doesn't seem to work, it is like if
>> the file
>> >>>>> were empty.
>> >>>>>
>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>> >>>>>
>> >>>>> Last thing, the web interface doesn't see the contents, and command
>> >>>>> hadoop -fs -ls says the file is empty.
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>> >>>>> <yh...@thoughtworks.com> wrote:
>> >>>>>>
>> >>>>>> Could you please clarify, are you opening the file in your mapper
>> code
>> >>>>>> and reading from there ?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Hemanth
>> >>>>>>
>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>> >>>>>>>
>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>> open
>> >>>>>>> file. The writing process, writes a line to the file and syncs
>> the file to
>> >>>>>>> readers.
>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>> >>>>>>>
>> >>>>>>> If I try to read the file from another process, it works fine, at
>> >>>>>>> least using
>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>> >>>>>>>
>> >>>>>>> hadoop -fs -tail also works just fine
>> >>>>>>>
>> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
>> the
>> >>>>>>> word count example, same thing, it is like if the file were empty
>> for the
>> >>>>>>> map reduce framework.
>> >>>>>>>
>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>> >>>>>>>
>> >>>>>>> I need some help around this.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
That also would explain the weird behavior of tail, which seems to always
jump to the start since file length is 0.

So, basically, sync doesn't update file length, any code based on file
size, is unreliable.

Am I right?

How can I get around this?

Lucas

On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> I didn't notice, thanks for the heads up.
>
>
> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Just an aside (I've not tried to look at the original issue yet), but
>> Scribe has not been maintained (nor has seen a release) in over a year
>> now -- looking at the commit history. Same case with both Facebook and
>> Twitter's fork.
>>
>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>> > Yeah I looked at scribe, looks good but sounds like too much for my
>> problem.
>> > I'd rather make it work the simple way. Could you pleas post your code,
>> may
>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>> block
>> > size or some other  parameter is different...
>> >
>> > Thanks!
>> > Lucas
>> >
>> >
>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>> > <yh...@thoughtworks.com> wrote:
>> >>
>> >> I am using the same version of Hadoop as you.
>> >>
>> >> Can you look at something like Scribe, which AFAIK fits the use case
>> you
>> >> describe.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>> wrote:
>> >>>
>> >>> That is exactly what I did, but in my case, it is like if the file
>> were
>> >>> empty, the job counters say no bytes read.
>> >>> I'm using hadoop 1.0.3 which version did you try?
>> >>>
>> >>> What I'm trying to do is just some basic analyitics on a product
>> search
>> >>> system. There is a search service, every time a user performs a
>> search, the
>> >>> search string, and the results are stored in this file, and the file
>> is
>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>> work,
>> >>> like I described, because the file looks empty for the map reduce
>> >>> components. I thought it was about pig, but I wasn't sure, so I tried
>> a
>> >>> simple mr job, and used the word count to test the map reduce
>> compoinents
>> >>> actually see the sync'ed bytes.
>> >>>
>> >>> Of course if I close the file, everything works perfectly, but I don't
>> >>> want to close the file every while, since that means I should create
>> another
>> >>> one (since no append support), and that would end up with too many
>> tiny
>> >>> files, something we know is bad for mr performance, and I don't want
>> to add
>> >>> more parts to this (like a file merging tool). I think unign sync is
>> a clean
>> >>> solution, since we don't care about writing performance, so I'd
>> rather keep
>> >>> it like this if I can make it work.
>> >>>
>> >>> Any idea besides hadoop version?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Lucas
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>> >>> <yh...@thoughtworks.com> wrote:
>> >>>>
>> >>>> Hi Lucas,
>> >>>>
>> >>>> I tried something like this but got different results.
>> >>>>
>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>> sync.
>> >>>> Without closing the file, I ran a wordcount with that file as input.
>> It did
>> >>>> work fine and was able to count the words that were sync'ed (even
>> though the
>> >>>> file length seems to come as 0 like you noted in fs -ls)
>> >>>>
>> >>>> So, not sure what's happening in your case. In the MR job, do the job
>> >>>> counters indicate no bytes were read ?
>> >>>>
>> >>>> On a different note though, if you can describe a little more what
>> you
>> >>>> are trying to accomplish, we could probably work a better solution.
>> >>>>
>> >>>> Thanks
>> >>>> hemanth
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Helo Hemanth, thanks for answering.
>> >>>>> The file is open by a separate process not map reduce related at
>> all.
>> >>>>> You can think of it as a servlet, receiving requests, and writing
>> them to
>> >>>>> this file, every time a request is received it is written and
>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>> >>>>>
>> >>>>> At the same time, I want to run a map reduce job over this file.
>> Simply
>> >>>>> runing the word count example doesn't seem to work, it is like if
>> the file
>> >>>>> were empty.
>> >>>>>
>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>> >>>>>
>> >>>>> Last thing, the web interface doesn't see the contents, and command
>> >>>>> hadoop -fs -ls says the file is empty.
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>> >>>>> <yh...@thoughtworks.com> wrote:
>> >>>>>>
>> >>>>>> Could you please clarify, are you opening the file in your mapper
>> code
>> >>>>>> and reading from there ?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Hemanth
>> >>>>>>
>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>> >>>>>>>
>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>> open
>> >>>>>>> file. The writing process, writes a line to the file and syncs
>> the file to
>> >>>>>>> readers.
>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>> >>>>>>>
>> >>>>>>> If I try to read the file from another process, it works fine, at
>> >>>>>>> least using
>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>> >>>>>>>
>> >>>>>>> hadoop -fs -tail also works just fine
>> >>>>>>>
>> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
>> the
>> >>>>>>> word count example, same thing, it is like if the file were empty
>> for the
>> >>>>>>> map reduce framework.
>> >>>>>>>
>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>> >>>>>>>
>> >>>>>>> I need some help around this.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

It looks like getSplits in FileInputFormat is ignoring 0 lenght files....
That also would explain the weird behavior of tail, which seems to always
jump to the start since file length is 0.

So, basically, sync doesn't update file length, any code based on file
size, is unreliable.

Am I right?

How can I get around this?

Lucas

On Mon, Feb 25, 2013 at 12:38 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> I didn't notice, thanks for the heads up.
>
>
> On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Just an aside (I've not tried to look at the original issue yet), but
>> Scribe has not been maintained (nor has seen a release) in over a year
>> now -- looking at the commit history. Same case with both Facebook and
>> Twitter's fork.
>>
>> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>> > Yeah I looked at scribe, looks good but sounds like too much for my
>> problem.
>> > I'd rather make it work the simple way. Could you pleas post your code,
>> may
>> > be I'm doing something wrong on the sync side. Maybe a buffer size,
>> block
>> > size or some other  parameter is different...
>> >
>> > Thanks!
>> > Lucas
>> >
>> >
>> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
>> > <yh...@thoughtworks.com> wrote:
>> >>
>> >> I am using the same version of Hadoop as you.
>> >>
>> >> Can you look at something like Scribe, which AFAIK fits the use case
>> you
>> >> describe.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
>> wrote:
>> >>>
>> >>> That is exactly what I did, but in my case, it is like if the file
>> were
>> >>> empty, the job counters say no bytes read.
>> >>> I'm using hadoop 1.0.3 which version did you try?
>> >>>
>> >>> What I'm trying to do is just some basic analyitics on a product
>> search
>> >>> system. There is a search service, every time a user performs a
>> search, the
>> >>> search string, and the results are stored in this file, and the file
>> is
>> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
>> work,
>> >>> like I described, because the file looks empty for the map reduce
>> >>> components. I thought it was about pig, but I wasn't sure, so I tried
>> a
>> >>> simple mr job, and used the word count to test the map reduce
>> compoinents
>> >>> actually see the sync'ed bytes.
>> >>>
>> >>> Of course if I close the file, everything works perfectly, but I don't
>> >>> want to close the file every while, since that means I should create
>> another
>> >>> one (since no append support), and that would end up with too many
>> tiny
>> >>> files, something we know is bad for mr performance, and I don't want
>> to add
>> >>> more parts to this (like a file merging tool). I think unign sync is
>> a clean
>> >>> solution, since we don't care about writing performance, so I'd
>> rather keep
>> >>> it like this if I can make it work.
>> >>>
>> >>> Any idea besides hadoop version?
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Lucas
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>> >>> <yh...@thoughtworks.com> wrote:
>> >>>>
>> >>>> Hi Lucas,
>> >>>>
>> >>>> I tried something like this but got different results.
>> >>>>
>> >>>> I wrote code that opened a file on HDFS, wrote a line and called
>> sync.
>> >>>> Without closing the file, I ran a wordcount with that file as input.
>> It did
>> >>>> work fine and was able to count the words that were sync'ed (even
>> though the
>> >>>> file length seems to come as 0 like you noted in fs -ls)
>> >>>>
>> >>>> So, not sure what's happening in your case. In the MR job, do the job
>> >>>> counters indicate no bytes were read ?
>> >>>>
>> >>>> On a different note though, if you can describe a little more what
>> you
>> >>>> are trying to accomplish, we could probably work a better solution.
>> >>>>
>> >>>> Thanks
>> >>>> hemanth
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Helo Hemanth, thanks for answering.
>> >>>>> The file is open by a separate process not map reduce related at
>> all.
>> >>>>> You can think of it as a servlet, receiving requests, and writing
>> them to
>> >>>>> this file, every time a request is received it is written and
>> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>> >>>>>
>> >>>>> At the same time, I want to run a map reduce job over this file.
>> Simply
>> >>>>> runing the word count example doesn't seem to work, it is like if
>> the file
>> >>>>> were empty.
>> >>>>>
>> >>>>> hadoop -fs -tail works just fine, and reading the file using
>> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>> >>>>>
>> >>>>> Last thing, the web interface doesn't see the contents, and command
>> >>>>> hadoop -fs -ls says the file is empty.
>> >>>>>
>> >>>>> What am I doing wrong?
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>> >>>>> <yh...@thoughtworks.com> wrote:
>> >>>>>>
>> >>>>>> Could you please clarify, are you opening the file in your mapper
>> code
>> >>>>>> and reading from there ?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Hemanth
>> >>>>>>
>> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>> >>>>>>>
>> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an
>> open
>> >>>>>>> file. The writing process, writes a line to the file and syncs
>> the file to
>> >>>>>>> readers.
>> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>> >>>>>>>
>> >>>>>>> If I try to read the file from another process, it works fine, at
>> >>>>>>> least using
>> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>> >>>>>>>
>> >>>>>>> hadoop -fs -tail also works just fine
>> >>>>>>>
>> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
>> the
>> >>>>>>> word count example, same thing, it is like if the file were empty
>> for the
>> >>>>>>> map reduce framework.
>> >>>>>>>
>> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>> >>>>>>>
>> >>>>>>> I need some help around this.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> Lucas
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Just an aside (I've not tried to look at the original issue yet), but
> Scribe has not been maintained (nor has seen a release) in over a year
> now -- looking at the commit history. Same case with both Facebook and
> Twitter's fork.
>
> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> > Yeah I looked at scribe, looks good but sounds like too much for my
> problem.
> > I'd rather make it work the simple way. Could you pleas post your code,
> may
> > be I'm doing something wrong on the sync side. Maybe a buffer size, block
> > size or some other  parameter is different...
> >
> > Thanks!
> > Lucas
> >
> >
> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> > <yh...@thoughtworks.com> wrote:
> >>
> >> I am using the same version of Hadoop as you.
> >>
> >> Can you look at something like Scribe, which AFAIK fits the use case you
> >> describe.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
> wrote:
> >>>
> >>> That is exactly what I did, but in my case, it is like if the file were
> >>> empty, the job counters say no bytes read.
> >>> I'm using hadoop 1.0.3 which version did you try?
> >>>
> >>> What I'm trying to do is just some basic analyitics on a product search
> >>> system. There is a search service, every time a user performs a
> search, the
> >>> search string, and the results are stored in this file, and the file is
> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
> work,
> >>> like I described, because the file looks empty for the map reduce
> >>> components. I thought it was about pig, but I wasn't sure, so I tried a
> >>> simple mr job, and used the word count to test the map reduce
> compoinents
> >>> actually see the sync'ed bytes.
> >>>
> >>> Of course if I close the file, everything works perfectly, but I don't
> >>> want to close the file every while, since that means I should create
> another
> >>> one (since no append support), and that would end up with too many tiny
> >>> files, something we know is bad for mr performance, and I don't want
> to add
> >>> more parts to this (like a file merging tool). I think unign sync is a
> clean
> >>> solution, since we don't care about writing performance, so I'd rather
> keep
> >>> it like this if I can make it work.
> >>>
> >>> Any idea besides hadoop version?
> >>>
> >>> Thanks!
> >>>
> >>> Lucas
> >>>
> >>>
> >>>
> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
> >>> <yh...@thoughtworks.com> wrote:
> >>>>
> >>>> Hi Lucas,
> >>>>
> >>>> I tried something like this but got different results.
> >>>>
> >>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
> >>>> Without closing the file, I ran a wordcount with that file as input.
> It did
> >>>> work fine and was able to count the words that were sync'ed (even
> though the
> >>>> file length seems to come as 0 like you noted in fs -ls)
> >>>>
> >>>> So, not sure what's happening in your case. In the MR job, do the job
> >>>> counters indicate no bytes were read ?
> >>>>
> >>>> On a different note though, if you can describe a little more what you
> >>>> are trying to accomplish, we could probably work a better solution.
> >>>>
> >>>> Thanks
> >>>> hemanth
> >>>>
> >>>>
> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Helo Hemanth, thanks for answering.
> >>>>> The file is open by a separate process not map reduce related at all.
> >>>>> You can think of it as a servlet, receiving requests, and writing
> them to
> >>>>> this file, every time a request is received it is written and
> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
> >>>>>
> >>>>> At the same time, I want to run a map reduce job over this file.
> Simply
> >>>>> runing the word count example doesn't seem to work, it is like if
> the file
> >>>>> were empty.
> >>>>>
> >>>>> hadoop -fs -tail works just fine, and reading the file using
> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
> >>>>>
> >>>>> Last thing, the web interface doesn't see the contents, and command
> >>>>> hadoop -fs -ls says the file is empty.
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
> >>>>> <yh...@thoughtworks.com> wrote:
> >>>>>>
> >>>>>> Could you please clarify, are you opening the file in your mapper
> code
> >>>>>> and reading from there ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Hemanth
> >>>>>>
> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
> >>>>>>>
> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
> >>>>>>> file. The writing process, writes a line to the file and syncs the
> file to
> >>>>>>> readers.
> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
> >>>>>>>
> >>>>>>> If I try to read the file from another process, it works fine, at
> >>>>>>> least using
> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
> >>>>>>>
> >>>>>>> hadoop -fs -tail also works just fine
> >>>>>>>
> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
> the
> >>>>>>> word count example, same thing, it is like if the file were empty
> for the
> >>>>>>> map reduce framework.
> >>>>>>>
> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
> >>>>>>>
> >>>>>>> I need some help around this.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Lucas
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Just an aside (I've not tried to look at the original issue yet), but
> Scribe has not been maintained (nor has seen a release) in over a year
> now -- looking at the commit history. Same case with both Facebook and
> Twitter's fork.
>
> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> > Yeah I looked at scribe, looks good but sounds like too much for my
> problem.
> > I'd rather make it work the simple way. Could you pleas post your code,
> may
> > be I'm doing something wrong on the sync side. Maybe a buffer size, block
> > size or some other  parameter is different...
> >
> > Thanks!
> > Lucas
> >
> >
> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> > <yh...@thoughtworks.com> wrote:
> >>
> >> I am using the same version of Hadoop as you.
> >>
> >> Can you look at something like Scribe, which AFAIK fits the use case you
> >> describe.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
> wrote:
> >>>
> >>> That is exactly what I did, but in my case, it is like if the file were
> >>> empty, the job counters say no bytes read.
> >>> I'm using hadoop 1.0.3 which version did you try?
> >>>
> >>> What I'm trying to do is just some basic analyitics on a product search
> >>> system. There is a search service, every time a user performs a
> search, the
> >>> search string, and the results are stored in this file, and the file is
> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
> work,
> >>> like I described, because the file looks empty for the map reduce
> >>> components. I thought it was about pig, but I wasn't sure, so I tried a
> >>> simple mr job, and used the word count to test the map reduce
> compoinents
> >>> actually see the sync'ed bytes.
> >>>
> >>> Of course if I close the file, everything works perfectly, but I don't
> >>> want to close the file every while, since that means I should create
> another
> >>> one (since no append support), and that would end up with too many tiny
> >>> files, something we know is bad for mr performance, and I don't want
> to add
> >>> more parts to this (like a file merging tool). I think unign sync is a
> clean
> >>> solution, since we don't care about writing performance, so I'd rather
> keep
> >>> it like this if I can make it work.
> >>>
> >>> Any idea besides hadoop version?
> >>>
> >>> Thanks!
> >>>
> >>> Lucas
> >>>
> >>>
> >>>
> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
> >>> <yh...@thoughtworks.com> wrote:
> >>>>
> >>>> Hi Lucas,
> >>>>
> >>>> I tried something like this but got different results.
> >>>>
> >>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
> >>>> Without closing the file, I ran a wordcount with that file as input.
> It did
> >>>> work fine and was able to count the words that were sync'ed (even
> though the
> >>>> file length seems to come as 0 like you noted in fs -ls)
> >>>>
> >>>> So, not sure what's happening in your case. In the MR job, do the job
> >>>> counters indicate no bytes were read ?
> >>>>
> >>>> On a different note though, if you can describe a little more what you
> >>>> are trying to accomplish, we could probably work a better solution.
> >>>>
> >>>> Thanks
> >>>> hemanth
> >>>>
> >>>>
> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Helo Hemanth, thanks for answering.
> >>>>> The file is open by a separate process not map reduce related at all.
> >>>>> You can think of it as a servlet, receiving requests, and writing
> them to
> >>>>> this file, every time a request is received it is written and
> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
> >>>>>
> >>>>> At the same time, I want to run a map reduce job over this file.
> Simply
> >>>>> runing the word count example doesn't seem to work, it is like if
> the file
> >>>>> were empty.
> >>>>>
> >>>>> hadoop -fs -tail works just fine, and reading the file using
> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
> >>>>>
> >>>>> Last thing, the web interface doesn't see the contents, and command
> >>>>> hadoop -fs -ls says the file is empty.
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
> >>>>> <yh...@thoughtworks.com> wrote:
> >>>>>>
> >>>>>> Could you please clarify, are you opening the file in your mapper
> code
> >>>>>> and reading from there ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Hemanth
> >>>>>>
> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
> >>>>>>>
> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
> >>>>>>> file. The writing process, writes a line to the file and syncs the
> file to
> >>>>>>> readers.
> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
> >>>>>>>
> >>>>>>> If I try to read the file from another process, it works fine, at
> >>>>>>> least using
> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
> >>>>>>>
> >>>>>>> hadoop -fs -tail also works just fine
> >>>>>>>
> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
> the
> >>>>>>> word count example, same thing, it is like if the file were empty
> for the
> >>>>>>> map reduce framework.
> >>>>>>>
> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
> >>>>>>>
> >>>>>>> I need some help around this.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Lucas
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Just an aside (I've not tried to look at the original issue yet), but
> Scribe has not been maintained (nor has seen a release) in over a year
> now -- looking at the commit history. Same case with both Facebook and
> Twitter's fork.
>
> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> > Yeah I looked at scribe, looks good but sounds like too much for my
> problem.
> > I'd rather make it work the simple way. Could you pleas post your code,
> may
> > be I'm doing something wrong on the sync side. Maybe a buffer size, block
> > size or some other  parameter is different...
> >
> > Thanks!
> > Lucas
> >
> >
> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> > <yh...@thoughtworks.com> wrote:
> >>
> >> I am using the same version of Hadoop as you.
> >>
> >> Can you look at something like Scribe, which AFAIK fits the use case you
> >> describe.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
> wrote:
> >>>
> >>> That is exactly what I did, but in my case, it is like if the file were
> >>> empty, the job counters say no bytes read.
> >>> I'm using hadoop 1.0.3 which version did you try?
> >>>
> >>> What I'm trying to do is just some basic analyitics on a product search
> >>> system. There is a search service, every time a user performs a
> search, the
> >>> search string, and the results are stored in this file, and the file is
> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
> work,
> >>> like I described, because the file looks empty for the map reduce
> >>> components. I thought it was about pig, but I wasn't sure, so I tried a
> >>> simple mr job, and used the word count to test the map reduce
> compoinents
> >>> actually see the sync'ed bytes.
> >>>
> >>> Of course if I close the file, everything works perfectly, but I don't
> >>> want to close the file every while, since that means I should create
> another
> >>> one (since no append support), and that would end up with too many tiny
> >>> files, something we know is bad for mr performance, and I don't want
> to add
> >>> more parts to this (like a file merging tool). I think unign sync is a
> clean
> >>> solution, since we don't care about writing performance, so I'd rather
> keep
> >>> it like this if I can make it work.
> >>>
> >>> Any idea besides hadoop version?
> >>>
> >>> Thanks!
> >>>
> >>> Lucas
> >>>
> >>>
> >>>
> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
> >>> <yh...@thoughtworks.com> wrote:
> >>>>
> >>>> Hi Lucas,
> >>>>
> >>>> I tried something like this but got different results.
> >>>>
> >>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
> >>>> Without closing the file, I ran a wordcount with that file as input.
> It did
> >>>> work fine and was able to count the words that were sync'ed (even
> though the
> >>>> file length seems to come as 0 like you noted in fs -ls)
> >>>>
> >>>> So, not sure what's happening in your case. In the MR job, do the job
> >>>> counters indicate no bytes were read ?
> >>>>
> >>>> On a different note though, if you can describe a little more what you
> >>>> are trying to accomplish, we could probably work a better solution.
> >>>>
> >>>> Thanks
> >>>> hemanth
> >>>>
> >>>>
> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Helo Hemanth, thanks for answering.
> >>>>> The file is open by a separate process not map reduce related at all.
> >>>>> You can think of it as a servlet, receiving requests, and writing
> them to
> >>>>> this file, every time a request is received it is written and
> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
> >>>>>
> >>>>> At the same time, I want to run a map reduce job over this file.
> Simply
> >>>>> runing the word count example doesn't seem to work, it is like if
> the file
> >>>>> were empty.
> >>>>>
> >>>>> hadoop -fs -tail works just fine, and reading the file using
> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
> >>>>>
> >>>>> Last thing, the web interface doesn't see the contents, and command
> >>>>> hadoop -fs -ls says the file is empty.
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
> >>>>> <yh...@thoughtworks.com> wrote:
> >>>>>>
> >>>>>> Could you please clarify, are you opening the file in your mapper
> code
> >>>>>> and reading from there ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Hemanth
> >>>>>>
> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
> >>>>>>>
> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
> >>>>>>> file. The writing process, writes a line to the file and syncs the
> file to
> >>>>>>> readers.
> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
> >>>>>>>
> >>>>>>> If I try to read the file from another process, it works fine, at
> >>>>>>> least using
> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
> >>>>>>>
> >>>>>>> hadoop -fs -tail also works just fine
> >>>>>>>
> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
> the
> >>>>>>> word count example, same thing, it is like if the file were empty
> for the
> >>>>>>> map reduce framework.
> >>>>>>>
> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
> >>>>>>>
> >>>>>>> I need some help around this.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Lucas
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

I didn't notice, thanks for the heads up.

On Mon, Feb 25, 2013 at 4:31 AM, Harsh J <ha...@cloudera.com> wrote:

> Just an aside (I've not tried to look at the original issue yet), but
> Scribe has not been maintained (nor has seen a release) in over a year
> now -- looking at the commit history. Same case with both Facebook and
> Twitter's fork.
>
> On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> > Yeah I looked at scribe, looks good but sounds like too much for my
> problem.
> > I'd rather make it work the simple way. Could you pleas post your code,
> may
> > be I'm doing something wrong on the sync side. Maybe a buffer size, block
> > size or some other  parameter is different...
> >
> > Thanks!
> > Lucas
> >
> >
> > On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> > <yh...@thoughtworks.com> wrote:
> >>
> >> I am using the same version of Hadoop as you.
> >>
> >> Can you look at something like Scribe, which AFAIK fits the use case you
> >> describe.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com>
> wrote:
> >>>
> >>> That is exactly what I did, but in my case, it is like if the file were
> >>> empty, the job counters say no bytes read.
> >>> I'm using hadoop 1.0.3 which version did you try?
> >>>
> >>> What I'm trying to do is just some basic analyitics on a product search
> >>> system. There is a search service, every time a user performs a
> search, the
> >>> search string, and the results are stored in this file, and the file is
> >>> sync'ed. I'm actually using pig to do some basic counts, it doesn't
> work,
> >>> like I described, because the file looks empty for the map reduce
> >>> components. I thought it was about pig, but I wasn't sure, so I tried a
> >>> simple mr job, and used the word count to test the map reduce
> compoinents
> >>> actually see the sync'ed bytes.
> >>>
> >>> Of course if I close the file, everything works perfectly, but I don't
> >>> want to close the file every while, since that means I should create
> another
> >>> one (since no append support), and that would end up with too many tiny
> >>> files, something we know is bad for mr performance, and I don't want
> to add
> >>> more parts to this (like a file merging tool). I think unign sync is a
> clean
> >>> solution, since we don't care about writing performance, so I'd rather
> keep
> >>> it like this if I can make it work.
> >>>
> >>> Any idea besides hadoop version?
> >>>
> >>> Thanks!
> >>>
> >>> Lucas
> >>>
> >>>
> >>>
> >>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
> >>> <yh...@thoughtworks.com> wrote:
> >>>>
> >>>> Hi Lucas,
> >>>>
> >>>> I tried something like this but got different results.
> >>>>
> >>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
> >>>> Without closing the file, I ran a wordcount with that file as input.
> It did
> >>>> work fine and was able to count the words that were sync'ed (even
> though the
> >>>> file length seems to come as 0 like you noted in fs -ls)
> >>>>
> >>>> So, not sure what's happening in your case. In the MR job, do the job
> >>>> counters indicate no bytes were read ?
> >>>>
> >>>> On a different note though, if you can describe a little more what you
> >>>> are trying to accomplish, we could probably work a better solution.
> >>>>
> >>>> Thanks
> >>>> hemanth
> >>>>
> >>>>
> >>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Helo Hemanth, thanks for answering.
> >>>>> The file is open by a separate process not map reduce related at all.
> >>>>> You can think of it as a servlet, receiving requests, and writing
> them to
> >>>>> this file, every time a request is received it is written and
> >>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
> >>>>>
> >>>>> At the same time, I want to run a map reduce job over this file.
> Simply
> >>>>> runing the word count example doesn't seem to work, it is like if
> the file
> >>>>> were empty.
> >>>>>
> >>>>> hadoop -fs -tail works just fine, and reading the file using
> >>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
> >>>>>
> >>>>> Last thing, the web interface doesn't see the contents, and command
> >>>>> hadoop -fs -ls says the file is empty.
> >>>>>
> >>>>> What am I doing wrong?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
> >>>>> <yh...@thoughtworks.com> wrote:
> >>>>>>
> >>>>>> Could you please clarify, are you opening the file in your mapper
> code
> >>>>>> and reading from there ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Hemanth
> >>>>>>
> >>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
> >>>>>>>
> >>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
> >>>>>>> file. The writing process, writes a line to the file and syncs the
> file to
> >>>>>>> readers.
> >>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
> >>>>>>>
> >>>>>>> If I try to read the file from another process, it works fine, at
> >>>>>>> least using
> >>>>>>> org.apache.hadoop.fs.FSDataInputStream.
> >>>>>>>
> >>>>>>> hadoop -fs -tail also works just fine
> >>>>>>>
> >>>>>>> But it looks like map reduce doesn't read any data. I tried using
> the
> >>>>>>> word count example, same thing, it is like if the file were empty
> for the
> >>>>>>> map reduce framework.
> >>>>>>>
> >>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
> >>>>>>>
> >>>>>>> I need some help around this.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Lucas
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: map reduce and sync

Posted by Harsh J <ha...@cloudera.com>.

Just an aside (I've not tried to look at the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my problem.
> I'd rather make it work the simple way. Could you pleas post your code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size, block
> size or some other  parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <yh...@thoughtworks.com> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use case you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a product search
>>> system. There is a search service, every time a user performs a search, the
>>> search string, and the results are stored in this file, and the file is
>>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>>> like I described, because the file looks empty for the map reduce
>>> components. I thought it was about pig, but I wasn't sure, so I tried a
>>> simple mr job, and used the word count to test the map reduce compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but I don't
>>> want to close the file every while, since that means I should create another
>>> one (since no append support), and that would end up with too many tiny
>>> files, something we know is bad for mr performance, and I don't want to add
>>> more parts to this (like a file merging tool). I think unign sync is a clean
>>> solution, since we don't care about writing performance, so I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <yh...@thoughtworks.com> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>>> Without closing the file, I ran a wordcount with that file as input. It did
>>>> work fine and was able to count the words that were sync'ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls)
>>>>
>>>> So, not sure what's happening in your case. In the MR job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little more what you
>>>> are trying to accomplish, we could probably work a better solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce related at all.
>>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>>> this file, every time a request is received it is written and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>>> runing the word count example doesn't seem to work, it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>>
>>>>> Last thing, the web interface doesn't see the contents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>>>> <yh...@thoughtworks.com> wrote:
>>>>>>
>>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>>> and reading from there ?
>>>>>>
>>>>>> Thanks
>>>>>> Hemanth
>>>>>>
>>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>>>
>>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>>> file. The writing process, writes a line to the file and syncs the file to
>>>>>>> readers.
>>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>>
>>>>>>> If I try to read the file from another process, it works fine, at
>>>>>>> least using
>>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>>
>>>>>>> hadoop -fs -tail also works just fine
>>>>>>>
>>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>>> map reduce framework.
>>>>>>>
>>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>>
>>>>>>> I need some help around this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lucas
>>>>>
>>>>>
>>>>
>>>
>>
>



--
Harsh J

Re: map reduce and sync

Posted by Harsh J <ha...@cloudera.com>.

Just an aside (I've not tried to look at the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my problem.
> I'd rather make it work the simple way. Could you pleas post your code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size, block
> size or some other  parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <yh...@thoughtworks.com> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use case you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a product search
>>> system. There is a search service, every time a user performs a search, the
>>> search string, and the results are stored in this file, and the file is
>>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>>> like I described, because the file looks empty for the map reduce
>>> components. I thought it was about pig, but I wasn't sure, so I tried a
>>> simple mr job, and used the word count to test the map reduce compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but I don't
>>> want to close the file every while, since that means I should create another
>>> one (since no append support), and that would end up with too many tiny
>>> files, something we know is bad for mr performance, and I don't want to add
>>> more parts to this (like a file merging tool). I think unign sync is a clean
>>> solution, since we don't care about writing performance, so I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <yh...@thoughtworks.com> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>>> Without closing the file, I ran a wordcount with that file as input. It did
>>>> work fine and was able to count the words that were sync'ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls)
>>>>
>>>> So, not sure what's happening in your case. In the MR job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little more what you
>>>> are trying to accomplish, we could probably work a better solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce related at all.
>>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>>> this file, every time a request is received it is written and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>>> runing the word count example doesn't seem to work, it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>>
>>>>> Last thing, the web interface doesn't see the contents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>>>> <yh...@thoughtworks.com> wrote:
>>>>>>
>>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>>> and reading from there ?
>>>>>>
>>>>>> Thanks
>>>>>> Hemanth
>>>>>>
>>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>>>
>>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>>> file. The writing process, writes a line to the file and syncs the file to
>>>>>>> readers.
>>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>>
>>>>>>> If I try to read the file from another process, it works fine, at
>>>>>>> least using
>>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>>
>>>>>>> hadoop -fs -tail also works just fine
>>>>>>>
>>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>>> map reduce framework.
>>>>>>>
>>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>>
>>>>>>> I need some help around this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lucas
>>>>>
>>>>>
>>>>
>>>
>>
>



--
Harsh J

Re: map reduce and sync

Posted by Harsh J <ha...@cloudera.com>.

Just an aside (I've not tried to look at the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my problem.
> I'd rather make it work the simple way. Could you pleas post your code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size, block
> size or some other  parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <yh...@thoughtworks.com> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use case you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a product search
>>> system. There is a search service, every time a user performs a search, the
>>> search string, and the results are stored in this file, and the file is
>>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>>> like I described, because the file looks empty for the map reduce
>>> components. I thought it was about pig, but I wasn't sure, so I tried a
>>> simple mr job, and used the word count to test the map reduce compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but I don't
>>> want to close the file every while, since that means I should create another
>>> one (since no append support), and that would end up with too many tiny
>>> files, something we know is bad for mr performance, and I don't want to add
>>> more parts to this (like a file merging tool). I think unign sync is a clean
>>> solution, since we don't care about writing performance, so I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <yh...@thoughtworks.com> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>>> Without closing the file, I ran a wordcount with that file as input. It did
>>>> work fine and was able to count the words that were sync'ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls)
>>>>
>>>> So, not sure what's happening in your case. In the MR job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little more what you
>>>> are trying to accomplish, we could probably work a better solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce related at all.
>>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>>> this file, every time a request is received it is written and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>>> runing the word count example doesn't seem to work, it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>>
>>>>> Last thing, the web interface doesn't see the contents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>>>> <yh...@thoughtworks.com> wrote:
>>>>>>
>>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>>> and reading from there ?
>>>>>>
>>>>>> Thanks
>>>>>> Hemanth
>>>>>>
>>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>>>
>>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>>> file. The writing process, writes a line to the file and syncs the file to
>>>>>>> readers.
>>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>>
>>>>>>> If I try to read the file from another process, it works fine, at
>>>>>>> least using
>>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>>
>>>>>>> hadoop -fs -tail also works just fine
>>>>>>>
>>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>>> map reduce framework.
>>>>>>>
>>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>>
>>>>>>> I need some help around this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lucas
>>>>>
>>>>>
>>>>
>>>
>>
>



--
Harsh J

Re: map reduce and sync

Posted by Harsh J <ha...@cloudera.com>.

Just an aside (I've not tried to look at the original issue yet), but
Scribe has not been maintained (nor has seen a release) in over a year
now -- looking at the commit history. Same case with both Facebook and
Twitter's fork.

On Mon, Feb 25, 2013 at 7:16 AM, Lucas Bernardi <lu...@gmail.com> wrote:
> Yeah I looked at scribe, looks good but sounds like too much for my problem.
> I'd rather make it work the simple way. Could you pleas post your code, may
> be I'm doing something wrong on the sync side. Maybe a buffer size, block
> size or some other  parameter is different...
>
> Thanks!
> Lucas
>
>
> On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala
> <yh...@thoughtworks.com> wrote:
>>
>> I am using the same version of Hadoop as you.
>>
>> Can you look at something like Scribe, which AFAIK fits the use case you
>> describe.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>>>
>>> That is exactly what I did, but in my case, it is like if the file were
>>> empty, the job counters say no bytes read.
>>> I'm using hadoop 1.0.3 which version did you try?
>>>
>>> What I'm trying to do is just some basic analyitics on a product search
>>> system. There is a search service, every time a user performs a search, the
>>> search string, and the results are stored in this file, and the file is
>>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>>> like I described, because the file looks empty for the map reduce
>>> components. I thought it was about pig, but I wasn't sure, so I tried a
>>> simple mr job, and used the word count to test the map reduce compoinents
>>> actually see the sync'ed bytes.
>>>
>>> Of course if I close the file, everything works perfectly, but I don't
>>> want to close the file every while, since that means I should create another
>>> one (since no append support), and that would end up with too many tiny
>>> files, something we know is bad for mr performance, and I don't want to add
>>> more parts to this (like a file merging tool). I think unign sync is a clean
>>> solution, since we don't care about writing performance, so I'd rather keep
>>> it like this if I can make it work.
>>>
>>> Any idea besides hadoop version?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala
>>> <yh...@thoughtworks.com> wrote:
>>>>
>>>> Hi Lucas,
>>>>
>>>> I tried something like this but got different results.
>>>>
>>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>>> Without closing the file, I ran a wordcount with that file as input. It did
>>>> work fine and was able to count the words that were sync'ed (even though the
>>>> file length seems to come as 0 like you noted in fs -ls)
>>>>
>>>> So, not sure what's happening in your case. In the MR job, do the job
>>>> counters indicate no bytes were read ?
>>>>
>>>> On a different note though, if you can describe a little more what you
>>>> are trying to accomplish, we could probably work a better solution.
>>>>
>>>> Thanks
>>>> hemanth
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Helo Hemanth, thanks for answering.
>>>>> The file is open by a separate process not map reduce related at all.
>>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>>> this file, every time a request is received it is written and
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>>
>>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>>> runing the word count example doesn't seem to work, it is like if the file
>>>>> were empty.
>>>>>
>>>>> hadoop -fs -tail works just fine, and reading the file using
>>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>>
>>>>> Last thing, the web interface doesn't see the contents, and command
>>>>> hadoop -fs -ls says the file is empty.
>>>>>
>>>>> What am I doing wrong?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala
>>>>> <yh...@thoughtworks.com> wrote:
>>>>>>
>>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>>> and reading from there ?
>>>>>>
>>>>>> Thanks
>>>>>> Hemanth
>>>>>>
>>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>>>
>>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>>> file. The writing process, writes a line to the file and syncs the file to
>>>>>>> readers.
>>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>>
>>>>>>> If I try to read the file from another process, it works fine, at
>>>>>>> least using
>>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>>
>>>>>>> hadoop -fs -tail also works just fine
>>>>>>>
>>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>>> map reduce framework.
>>>>>>>
>>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>>
>>>>>>> I need some help around this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Lucas
>>>>>
>>>>>
>>>>
>>>
>>
>



--
Harsh J

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Yeah I looked at scribe, looks good but sounds like too much for my
problem. I'd rather make it work the simple way. Could you pleas post your
code, may be I'm doing something wrong on the sync side. Maybe a buffer
size, block size or some other  parameter is different...

Thanks!
Lucas

On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> I am using the same version of Hadoop as you.
>
> Can you look at something like Scribe, which AFAIK fits the use case you
> describe.
>
> Thanks
> Hemanth
>
>
> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> That is exactly what I did, but in my case, it is like if the file were
>> empty, the job counters say no bytes read.
>> I'm using hadoop 1.0.3 which version did you try?
>>
>> What I'm trying to do is just some basic analyitics on a product search
>> system. There is a search service, every time a user performs a search, the
>> search string, and the results are stored in this file, and the file is
>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>> like I described, because the file looks empty for the map reduce
>> components. I thought it was about pig, but I wasn't sure, so I tried a
>> simple mr job, and used the word count to test the map reduce compoinents
>> actually see the sync'ed bytes.
>>
>> Of course if I close the file, everything works perfectly, but I don't
>> want to close the file every while, since that means I should create
>> another one (since no append support), and that would end up with too many
>> tiny files, something we know is bad for mr performance, and I don't want
>> to add more parts to this (like a file merging tool). I think unign sync is
>> a clean solution, since we don't care about writing performance, so I'd
>> rather keep it like this if I can make it work.
>>
>> Any idea besides hadoop version?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Hi Lucas,
>>>
>>> I tried something like this but got different results.
>>>
>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>> Without closing the file, I ran a wordcount with that file as input. It did
>>> work fine and was able to count the words that were sync'ed (even though
>>> the file length seems to come as 0 like you noted in fs -ls)
>>>
>>> So, not sure what's happening in your case. In the MR job, do the job
>>> counters indicate no bytes were read ?
>>>
>>> On a different note though, if you can describe a little more what you
>>> are trying to accomplish, we could probably work a better solution.
>>>
>>> Thanks
>>> hemanth
>>>
>>>
>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>wrote:
>>>
>>>> Helo Hemanth, thanks for answering.
>>>> The file is open by a separate process not map reduce related at all.
>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>> this file, every time a request is received it is written and
>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>
>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>> runing the word count example doesn't seem to work, it is like if the file
>>>> were empty.
>>>>
>>>> hadoop -fs -tail works just fine, and reading the file using
>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>
>>>> Last thing, the web interface doesn't see the contents, and command
>>>> hadoop -fs -ls says the file is empty.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>>> yhemanth@thoughtworks.com> wrote:
>>>>
>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>> and reading from there ?
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>
>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>> file. The writing process, writes a line to the file and syncs the
>>>>>> file to readers.
>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>
>>>>>> If I try to read the file from another process, it works fine, at
>>>>>> least using
>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>
>>>>>> hadoop -fs -tail also works just fine
>>>>>>
>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>> map reduce framework.
>>>>>>
>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>
>>>>>> I need some help around this.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Lucas
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Yeah I looked at scribe, looks good but sounds like too much for my
problem. I'd rather make it work the simple way. Could you pleas post your
code, may be I'm doing something wrong on the sync side. Maybe a buffer
size, block size or some other  parameter is different...

Thanks!
Lucas

On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> I am using the same version of Hadoop as you.
>
> Can you look at something like Scribe, which AFAIK fits the use case you
> describe.
>
> Thanks
> Hemanth
>
>
> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> That is exactly what I did, but in my case, it is like if the file were
>> empty, the job counters say no bytes read.
>> I'm using hadoop 1.0.3 which version did you try?
>>
>> What I'm trying to do is just some basic analyitics on a product search
>> system. There is a search service, every time a user performs a search, the
>> search string, and the results are stored in this file, and the file is
>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>> like I described, because the file looks empty for the map reduce
>> components. I thought it was about pig, but I wasn't sure, so I tried a
>> simple mr job, and used the word count to test the map reduce compoinents
>> actually see the sync'ed bytes.
>>
>> Of course if I close the file, everything works perfectly, but I don't
>> want to close the file every while, since that means I should create
>> another one (since no append support), and that would end up with too many
>> tiny files, something we know is bad for mr performance, and I don't want
>> to add more parts to this (like a file merging tool). I think unign sync is
>> a clean solution, since we don't care about writing performance, so I'd
>> rather keep it like this if I can make it work.
>>
>> Any idea besides hadoop version?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Hi Lucas,
>>>
>>> I tried something like this but got different results.
>>>
>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>> Without closing the file, I ran a wordcount with that file as input. It did
>>> work fine and was able to count the words that were sync'ed (even though
>>> the file length seems to come as 0 like you noted in fs -ls)
>>>
>>> So, not sure what's happening in your case. In the MR job, do the job
>>> counters indicate no bytes were read ?
>>>
>>> On a different note though, if you can describe a little more what you
>>> are trying to accomplish, we could probably work a better solution.
>>>
>>> Thanks
>>> hemanth
>>>
>>>
>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>wrote:
>>>
>>>> Helo Hemanth, thanks for answering.
>>>> The file is open by a separate process not map reduce related at all.
>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>> this file, every time a request is received it is written and
>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>
>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>> runing the word count example doesn't seem to work, it is like if the file
>>>> were empty.
>>>>
>>>> hadoop -fs -tail works just fine, and reading the file using
>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>
>>>> Last thing, the web interface doesn't see the contents, and command
>>>> hadoop -fs -ls says the file is empty.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>>> yhemanth@thoughtworks.com> wrote:
>>>>
>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>> and reading from there ?
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>
>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>> file. The writing process, writes a line to the file and syncs the
>>>>>> file to readers.
>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>
>>>>>> If I try to read the file from another process, it works fine, at
>>>>>> least using
>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>
>>>>>> hadoop -fs -tail also works just fine
>>>>>>
>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>> map reduce framework.
>>>>>>
>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>
>>>>>> I need some help around this.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Lucas
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Yeah I looked at scribe, looks good but sounds like too much for my
problem. I'd rather make it work the simple way. Could you pleas post your
code, may be I'm doing something wrong on the sync side. Maybe a buffer
size, block size or some other  parameter is different...

Thanks!
Lucas

On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> I am using the same version of Hadoop as you.
>
> Can you look at something like Scribe, which AFAIK fits the use case you
> describe.
>
> Thanks
> Hemanth
>
>
> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> That is exactly what I did, but in my case, it is like if the file were
>> empty, the job counters say no bytes read.
>> I'm using hadoop 1.0.3 which version did you try?
>>
>> What I'm trying to do is just some basic analyitics on a product search
>> system. There is a search service, every time a user performs a search, the
>> search string, and the results are stored in this file, and the file is
>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>> like I described, because the file looks empty for the map reduce
>> components. I thought it was about pig, but I wasn't sure, so I tried a
>> simple mr job, and used the word count to test the map reduce compoinents
>> actually see the sync'ed bytes.
>>
>> Of course if I close the file, everything works perfectly, but I don't
>> want to close the file every while, since that means I should create
>> another one (since no append support), and that would end up with too many
>> tiny files, something we know is bad for mr performance, and I don't want
>> to add more parts to this (like a file merging tool). I think unign sync is
>> a clean solution, since we don't care about writing performance, so I'd
>> rather keep it like this if I can make it work.
>>
>> Any idea besides hadoop version?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Hi Lucas,
>>>
>>> I tried something like this but got different results.
>>>
>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>> Without closing the file, I ran a wordcount with that file as input. It did
>>> work fine and was able to count the words that were sync'ed (even though
>>> the file length seems to come as 0 like you noted in fs -ls)
>>>
>>> So, not sure what's happening in your case. In the MR job, do the job
>>> counters indicate no bytes were read ?
>>>
>>> On a different note though, if you can describe a little more what you
>>> are trying to accomplish, we could probably work a better solution.
>>>
>>> Thanks
>>> hemanth
>>>
>>>
>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>wrote:
>>>
>>>> Helo Hemanth, thanks for answering.
>>>> The file is open by a separate process not map reduce related at all.
>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>> this file, every time a request is received it is written and
>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>
>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>> runing the word count example doesn't seem to work, it is like if the file
>>>> were empty.
>>>>
>>>> hadoop -fs -tail works just fine, and reading the file using
>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>
>>>> Last thing, the web interface doesn't see the contents, and command
>>>> hadoop -fs -ls says the file is empty.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>>> yhemanth@thoughtworks.com> wrote:
>>>>
>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>> and reading from there ?
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>
>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>> file. The writing process, writes a line to the file and syncs the
>>>>>> file to readers.
>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>
>>>>>> If I try to read the file from another process, it works fine, at
>>>>>> least using
>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>
>>>>>> hadoop -fs -tail also works just fine
>>>>>>
>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>> map reduce framework.
>>>>>>
>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>
>>>>>> I need some help around this.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Lucas
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Yeah I looked at scribe, looks good but sounds like too much for my
problem. I'd rather make it work the simple way. Could you pleas post your
code, may be I'm doing something wrong on the sync side. Maybe a buffer
size, block size or some other  parameter is different...

Thanks!
Lucas

On Sun, Feb 24, 2013 at 10:31 PM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> I am using the same version of Hadoop as you.
>
> Can you look at something like Scribe, which AFAIK fits the use case you
> describe.
>
> Thanks
> Hemanth
>
>
> On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> That is exactly what I did, but in my case, it is like if the file were
>> empty, the job counters say no bytes read.
>> I'm using hadoop 1.0.3 which version did you try?
>>
>> What I'm trying to do is just some basic analyitics on a product search
>> system. There is a search service, every time a user performs a search, the
>> search string, and the results are stored in this file, and the file is
>> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
>> like I described, because the file looks empty for the map reduce
>> components. I thought it was about pig, but I wasn't sure, so I tried a
>> simple mr job, and used the word count to test the map reduce compoinents
>> actually see the sync'ed bytes.
>>
>> Of course if I close the file, everything works perfectly, but I don't
>> want to close the file every while, since that means I should create
>> another one (since no append support), and that would end up with too many
>> tiny files, something we know is bad for mr performance, and I don't want
>> to add more parts to this (like a file merging tool). I think unign sync is
>> a clean solution, since we don't care about writing performance, so I'd
>> rather keep it like this if I can make it work.
>>
>> Any idea besides hadoop version?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Hi Lucas,
>>>
>>> I tried something like this but got different results.
>>>
>>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>>> Without closing the file, I ran a wordcount with that file as input. It did
>>> work fine and was able to count the words that were sync'ed (even though
>>> the file length seems to come as 0 like you noted in fs -ls)
>>>
>>> So, not sure what's happening in your case. In the MR job, do the job
>>> counters indicate no bytes were read ?
>>>
>>> On a different note though, if you can describe a little more what you
>>> are trying to accomplish, we could probably work a better solution.
>>>
>>> Thanks
>>> hemanth
>>>
>>>
>>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com>wrote:
>>>
>>>> Helo Hemanth, thanks for answering.
>>>> The file is open by a separate process not map reduce related at all.
>>>> You can think of it as a servlet, receiving requests, and writing them to
>>>> this file, every time a request is received it is written and
>>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>>
>>>> At the same time, I want to run a map reduce job over this file. Simply
>>>> runing the word count example doesn't seem to work, it is like if the file
>>>> were empty.
>>>>
>>>> hadoop -fs -tail works just fine, and reading the file using
>>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>>
>>>> Last thing, the web interface doesn't see the contents, and command
>>>> hadoop -fs -ls says the file is empty.
>>>>
>>>> What am I doing wrong?
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>>
>>>>
>>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>>> yhemanth@thoughtworks.com> wrote:
>>>>
>>>>> Could you please clarify, are you opening the file in your mapper code
>>>>> and reading from there ?
>>>>>
>>>>> Thanks
>>>>> Hemanth
>>>>>
>>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>>
>>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>>> file. The writing process, writes a line to the file and syncs the
>>>>>> file to readers.
>>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>>
>>>>>> If I try to read the file from another process, it works fine, at
>>>>>> least using
>>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>>
>>>>>> hadoop -fs -tail also works just fine
>>>>>>
>>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>>> word count example, same thing, it is like if the file were empty for the
>>>>>> map reduce framework.
>>>>>>
>>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>>
>>>>>> I need some help around this.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Lucas
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

I am using the same version of Hadoop as you.

Can you look at something like Scribe, which AFAIK fits the use case you
describe.

Thanks
Hemanth


On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:

> That is exactly what I did, but in my case, it is like if the file were
> empty, the job counters say no bytes read.
> I'm using hadoop 1.0.3 which version did you try?
>
> What I'm trying to do is just some basic analyitics on a product search
> system. There is a search service, every time a user performs a search, the
> search string, and the results are stored in this file, and the file is
> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
> like I described, because the file looks empty for the map reduce
> components. I thought it was about pig, but I wasn't sure, so I tried a
> simple mr job, and used the word count to test the map reduce compoinents
> actually see the sync'ed bytes.
>
> Of course if I close the file, everything works perfectly, but I don't
> want to close the file every while, since that means I should create
> another one (since no append support), and that would end up with too many
> tiny files, something we know is bad for mr performance, and I don't want
> to add more parts to this (like a file merging tool). I think unign sync is
> a clean solution, since we don't care about writing performance, so I'd
> rather keep it like this if I can make it work.
>
> Any idea besides hadoop version?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Hi Lucas,
>>
>> I tried something like this but got different results.
>>
>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>> Without closing the file, I ran a wordcount with that file as input. It did
>> work fine and was able to count the words that were sync'ed (even though
>> the file length seems to come as 0 like you noted in fs -ls)
>>
>> So, not sure what's happening in your case. In the MR job, do the job
>> counters indicate no bytes were read ?
>>
>> On a different note though, if you can describe a little more what you
>> are trying to accomplish, we could probably work a better solution.
>>
>> Thanks
>> hemanth
>>
>>
>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>>
>>> Helo Hemanth, thanks for answering.
>>> The file is open by a separate process not map reduce related at all.
>>> You can think of it as a servlet, receiving requests, and writing them to
>>> this file, every time a request is received it is written and
>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>
>>> At the same time, I want to run a map reduce job over this file. Simply
>>> runing the word count example doesn't seem to work, it is like if the file
>>> were empty.
>>>
>>> hadoop -fs -tail works just fine, and reading the file using
>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>
>>> Last thing, the web interface doesn't see the contents, and command
>>> hadoop -fs -ls says the file is empty.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>> yhemanth@thoughtworks.com> wrote:
>>>
>>>> Could you please clarify, are you opening the file in your mapper code
>>>> and reading from there ?
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>
>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>> file. The writing process, writes a line to the file and syncs the
>>>>> file to readers.
>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>
>>>>> If I try to read the file from another process, it works fine, at
>>>>> least using
>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>
>>>>> hadoop -fs -tail also works just fine
>>>>>
>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>> word count example, same thing, it is like if the file were empty for the
>>>>> map reduce framework.
>>>>>
>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>
>>>>> I need some help around this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

I am using the same version of Hadoop as you.

Can you look at something like Scribe, which AFAIK fits the use case you
describe.

Thanks
Hemanth


On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:

> That is exactly what I did, but in my case, it is like if the file were
> empty, the job counters say no bytes read.
> I'm using hadoop 1.0.3 which version did you try?
>
> What I'm trying to do is just some basic analyitics on a product search
> system. There is a search service, every time a user performs a search, the
> search string, and the results are stored in this file, and the file is
> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
> like I described, because the file looks empty for the map reduce
> components. I thought it was about pig, but I wasn't sure, so I tried a
> simple mr job, and used the word count to test the map reduce compoinents
> actually see the sync'ed bytes.
>
> Of course if I close the file, everything works perfectly, but I don't
> want to close the file every while, since that means I should create
> another one (since no append support), and that would end up with too many
> tiny files, something we know is bad for mr performance, and I don't want
> to add more parts to this (like a file merging tool). I think unign sync is
> a clean solution, since we don't care about writing performance, so I'd
> rather keep it like this if I can make it work.
>
> Any idea besides hadoop version?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Hi Lucas,
>>
>> I tried something like this but got different results.
>>
>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>> Without closing the file, I ran a wordcount with that file as input. It did
>> work fine and was able to count the words that were sync'ed (even though
>> the file length seems to come as 0 like you noted in fs -ls)
>>
>> So, not sure what's happening in your case. In the MR job, do the job
>> counters indicate no bytes were read ?
>>
>> On a different note though, if you can describe a little more what you
>> are trying to accomplish, we could probably work a better solution.
>>
>> Thanks
>> hemanth
>>
>>
>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>>
>>> Helo Hemanth, thanks for answering.
>>> The file is open by a separate process not map reduce related at all.
>>> You can think of it as a servlet, receiving requests, and writing them to
>>> this file, every time a request is received it is written and
>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>
>>> At the same time, I want to run a map reduce job over this file. Simply
>>> runing the word count example doesn't seem to work, it is like if the file
>>> were empty.
>>>
>>> hadoop -fs -tail works just fine, and reading the file using
>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>
>>> Last thing, the web interface doesn't see the contents, and command
>>> hadoop -fs -ls says the file is empty.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>> yhemanth@thoughtworks.com> wrote:
>>>
>>>> Could you please clarify, are you opening the file in your mapper code
>>>> and reading from there ?
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>
>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>> file. The writing process, writes a line to the file and syncs the
>>>>> file to readers.
>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>
>>>>> If I try to read the file from another process, it works fine, at
>>>>> least using
>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>
>>>>> hadoop -fs -tail also works just fine
>>>>>
>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>> word count example, same thing, it is like if the file were empty for the
>>>>> map reduce framework.
>>>>>
>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>
>>>>> I need some help around this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

I am using the same version of Hadoop as you.

Can you look at something like Scribe, which AFAIK fits the use case you
describe.

Thanks
Hemanth


On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:

> That is exactly what I did, but in my case, it is like if the file were
> empty, the job counters say no bytes read.
> I'm using hadoop 1.0.3 which version did you try?
>
> What I'm trying to do is just some basic analyitics on a product search
> system. There is a search service, every time a user performs a search, the
> search string, and the results are stored in this file, and the file is
> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
> like I described, because the file looks empty for the map reduce
> components. I thought it was about pig, but I wasn't sure, so I tried a
> simple mr job, and used the word count to test the map reduce compoinents
> actually see the sync'ed bytes.
>
> Of course if I close the file, everything works perfectly, but I don't
> want to close the file every while, since that means I should create
> another one (since no append support), and that would end up with too many
> tiny files, something we know is bad for mr performance, and I don't want
> to add more parts to this (like a file merging tool). I think unign sync is
> a clean solution, since we don't care about writing performance, so I'd
> rather keep it like this if I can make it work.
>
> Any idea besides hadoop version?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Hi Lucas,
>>
>> I tried something like this but got different results.
>>
>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>> Without closing the file, I ran a wordcount with that file as input. It did
>> work fine and was able to count the words that were sync'ed (even though
>> the file length seems to come as 0 like you noted in fs -ls)
>>
>> So, not sure what's happening in your case. In the MR job, do the job
>> counters indicate no bytes were read ?
>>
>> On a different note though, if you can describe a little more what you
>> are trying to accomplish, we could probably work a better solution.
>>
>> Thanks
>> hemanth
>>
>>
>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>>
>>> Helo Hemanth, thanks for answering.
>>> The file is open by a separate process not map reduce related at all.
>>> You can think of it as a servlet, receiving requests, and writing them to
>>> this file, every time a request is received it is written and
>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>
>>> At the same time, I want to run a map reduce job over this file. Simply
>>> runing the word count example doesn't seem to work, it is like if the file
>>> were empty.
>>>
>>> hadoop -fs -tail works just fine, and reading the file using
>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>
>>> Last thing, the web interface doesn't see the contents, and command
>>> hadoop -fs -ls says the file is empty.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>> yhemanth@thoughtworks.com> wrote:
>>>
>>>> Could you please clarify, are you opening the file in your mapper code
>>>> and reading from there ?
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>
>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>> file. The writing process, writes a line to the file and syncs the
>>>>> file to readers.
>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>
>>>>> If I try to read the file from another process, it works fine, at
>>>>> least using
>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>
>>>>> hadoop -fs -tail also works just fine
>>>>>
>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>> word count example, same thing, it is like if the file were empty for the
>>>>> map reduce framework.
>>>>>
>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>
>>>>> I need some help around this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

I am using the same version of Hadoop as you.

Can you look at something like Scribe, which AFAIK fits the use case you
describe.

Thanks
Hemanth


On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi <lu...@gmail.com> wrote:

> That is exactly what I did, but in my case, it is like if the file were
> empty, the job counters say no bytes read.
> I'm using hadoop 1.0.3 which version did you try?
>
> What I'm trying to do is just some basic analyitics on a product search
> system. There is a search service, every time a user performs a search, the
> search string, and the results are stored in this file, and the file is
> sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
> like I described, because the file looks empty for the map reduce
> components. I thought it was about pig, but I wasn't sure, so I tried a
> simple mr job, and used the word count to test the map reduce compoinents
> actually see the sync'ed bytes.
>
> Of course if I close the file, everything works perfectly, but I don't
> want to close the file every while, since that means I should create
> another one (since no append support), and that would end up with too many
> tiny files, something we know is bad for mr performance, and I don't want
> to add more parts to this (like a file merging tool). I think unign sync is
> a clean solution, since we don't care about writing performance, so I'd
> rather keep it like this if I can make it work.
>
> Any idea besides hadoop version?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Hi Lucas,
>>
>> I tried something like this but got different results.
>>
>> I wrote code that opened a file on HDFS, wrote a line and called sync.
>> Without closing the file, I ran a wordcount with that file as input. It did
>> work fine and was able to count the words that were sync'ed (even though
>> the file length seems to come as 0 like you noted in fs -ls)
>>
>> So, not sure what's happening in your case. In the MR job, do the job
>> counters indicate no bytes were read ?
>>
>> On a different note though, if you can describe a little more what you
>> are trying to accomplish, we could probably work a better solution.
>>
>> Thanks
>> hemanth
>>
>>
>> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>>
>>> Helo Hemanth, thanks for answering.
>>> The file is open by a separate process not map reduce related at all.
>>> You can think of it as a servlet, receiving requests, and writing them to
>>> this file, every time a request is received it is written and
>>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>>
>>> At the same time, I want to run a map reduce job over this file. Simply
>>> runing the word count example doesn't seem to work, it is like if the file
>>> were empty.
>>>
>>> hadoop -fs -tail works just fine, and reading the file using
>>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>>
>>> Last thing, the web interface doesn't see the contents, and command
>>> hadoop -fs -ls says the file is empty.
>>>
>>> What am I doing wrong?
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>>
>>>
>>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>>> yhemanth@thoughtworks.com> wrote:
>>>
>>>> Could you please clarify, are you opening the file in your mapper code
>>>> and reading from there ?
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>>
>>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>>> file. The writing process, writes a line to the file and syncs the
>>>>> file to readers.
>>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>>
>>>>> If I try to read the file from another process, it works fine, at
>>>>> least using
>>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>>
>>>>> hadoop -fs -tail also works just fine
>>>>>
>>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>>> word count example, same thing, it is like if the file were empty for the
>>>>> map reduce framework.
>>>>>
>>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>>
>>>>> I need some help around this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Lucas
>>>>>
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

That is exactly what I did, but in my case, it is like if the file were
empty, the job counters say no bytes read.
I'm using hadoop 1.0.3 which version did you try?

What I'm trying to do is just some basic analyitics on a product search
system. There is a search service, every time a user performs a search, the
search string, and the results are stored in this file, and the file is
sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
like I described, because the file looks empty for the map reduce
components. I thought it was about pig, but I wasn't sure, so I tried a
simple mr job, and used the word count to test the map reduce compoinents
actually see the sync'ed bytes.

Of course if I close the file, everything works perfectly, but I don't want
to close the file every while, since that means I should create another one
(since no append support), and that would end up with too many tiny files,
something we know is bad for mr performance, and I don't want to add more
parts to this (like a file merging tool). I think unign sync is a clean
solution, since we don't care about writing performance, so I'd rather keep
it like this if I can make it work.

Any idea besides hadoop version?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> Hi Lucas,
>
> I tried something like this but got different results.
>
> I wrote code that opened a file on HDFS, wrote a line and called sync.
> Without closing the file, I ran a wordcount with that file as input. It did
> work fine and was able to count the words that were sync'ed (even though
> the file length seems to come as 0 like you noted in fs -ls)
>
> So, not sure what's happening in your case. In the MR job, do the job
> counters indicate no bytes were read ?
>
> On a different note though, if you can describe a little more what you are
> trying to accomplish, we could probably work a better solution.
>
> Thanks
> hemanth
>
>
> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> Helo Hemanth, thanks for answering.
>> The file is open by a separate process not map reduce related at all. You
>> can think of it as a servlet, receiving requests, and writing them to this
>> file, every time a request is received it is written and
>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>
>> At the same time, I want to run a map reduce job over this file. Simply
>> runing the word count example doesn't seem to work, it is like if the file
>> were empty.
>>
>> hadoop -fs -tail works just fine, and reading the file using
>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>
>> Last thing, the web interface doesn't see the contents, and command
>> hadoop -fs -ls says the file is empty.
>>
>> What am I doing wrong?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Could you please clarify, are you opening the file in your mapper code
>>> and reading from there ?
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>
>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>> file. The writing process, writes a line to the file and syncs the
>>>> file to readers.
>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>
>>>> If I try to read the file from another process, it works fine, at least
>>>> using
>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>
>>>> hadoop -fs -tail also works just fine
>>>>
>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>> word count example, same thing, it is like if the file were empty for the
>>>> map reduce framework.
>>>>
>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>
>>>> I need some help around this.
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

That is exactly what I did, but in my case, it is like if the file were
empty, the job counters say no bytes read.
I'm using hadoop 1.0.3 which version did you try?

What I'm trying to do is just some basic analyitics on a product search
system. There is a search service, every time a user performs a search, the
search string, and the results are stored in this file, and the file is
sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
like I described, because the file looks empty for the map reduce
components. I thought it was about pig, but I wasn't sure, so I tried a
simple mr job, and used the word count to test the map reduce compoinents
actually see the sync'ed bytes.

Of course if I close the file, everything works perfectly, but I don't want
to close the file every while, since that means I should create another one
(since no append support), and that would end up with too many tiny files,
something we know is bad for mr performance, and I don't want to add more
parts to this (like a file merging tool). I think unign sync is a clean
solution, since we don't care about writing performance, so I'd rather keep
it like this if I can make it work.

Any idea besides hadoop version?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> Hi Lucas,
>
> I tried something like this but got different results.
>
> I wrote code that opened a file on HDFS, wrote a line and called sync.
> Without closing the file, I ran a wordcount with that file as input. It did
> work fine and was able to count the words that were sync'ed (even though
> the file length seems to come as 0 like you noted in fs -ls)
>
> So, not sure what's happening in your case. In the MR job, do the job
> counters indicate no bytes were read ?
>
> On a different note though, if you can describe a little more what you are
> trying to accomplish, we could probably work a better solution.
>
> Thanks
> hemanth
>
>
> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> Helo Hemanth, thanks for answering.
>> The file is open by a separate process not map reduce related at all. You
>> can think of it as a servlet, receiving requests, and writing them to this
>> file, every time a request is received it is written and
>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>
>> At the same time, I want to run a map reduce job over this file. Simply
>> runing the word count example doesn't seem to work, it is like if the file
>> were empty.
>>
>> hadoop -fs -tail works just fine, and reading the file using
>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>
>> Last thing, the web interface doesn't see the contents, and command
>> hadoop -fs -ls says the file is empty.
>>
>> What am I doing wrong?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Could you please clarify, are you opening the file in your mapper code
>>> and reading from there ?
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>
>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>> file. The writing process, writes a line to the file and syncs the
>>>> file to readers.
>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>
>>>> If I try to read the file from another process, it works fine, at least
>>>> using
>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>
>>>> hadoop -fs -tail also works just fine
>>>>
>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>> word count example, same thing, it is like if the file were empty for the
>>>> map reduce framework.
>>>>
>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>
>>>> I need some help around this.
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

That is exactly what I did, but in my case, it is like if the file were
empty, the job counters say no bytes read.
I'm using hadoop 1.0.3 which version did you try?

What I'm trying to do is just some basic analyitics on a product search
system. There is a search service, every time a user performs a search, the
search string, and the results are stored in this file, and the file is
sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
like I described, because the file looks empty for the map reduce
components. I thought it was about pig, but I wasn't sure, so I tried a
simple mr job, and used the word count to test the map reduce compoinents
actually see the sync'ed bytes.

Of course if I close the file, everything works perfectly, but I don't want
to close the file every while, since that means I should create another one
(since no append support), and that would end up with too many tiny files,
something we know is bad for mr performance, and I don't want to add more
parts to this (like a file merging tool). I think unign sync is a clean
solution, since we don't care about writing performance, so I'd rather keep
it like this if I can make it work.

Any idea besides hadoop version?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> Hi Lucas,
>
> I tried something like this but got different results.
>
> I wrote code that opened a file on HDFS, wrote a line and called sync.
> Without closing the file, I ran a wordcount with that file as input. It did
> work fine and was able to count the words that were sync'ed (even though
> the file length seems to come as 0 like you noted in fs -ls)
>
> So, not sure what's happening in your case. In the MR job, do the job
> counters indicate no bytes were read ?
>
> On a different note though, if you can describe a little more what you are
> trying to accomplish, we could probably work a better solution.
>
> Thanks
> hemanth
>
>
> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> Helo Hemanth, thanks for answering.
>> The file is open by a separate process not map reduce related at all. You
>> can think of it as a servlet, receiving requests, and writing them to this
>> file, every time a request is received it is written and
>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>
>> At the same time, I want to run a map reduce job over this file. Simply
>> runing the word count example doesn't seem to work, it is like if the file
>> were empty.
>>
>> hadoop -fs -tail works just fine, and reading the file using
>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>
>> Last thing, the web interface doesn't see the contents, and command
>> hadoop -fs -ls says the file is empty.
>>
>> What am I doing wrong?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Could you please clarify, are you opening the file in your mapper code
>>> and reading from there ?
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>
>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>> file. The writing process, writes a line to the file and syncs the
>>>> file to readers.
>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>
>>>> If I try to read the file from another process, it works fine, at least
>>>> using
>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>
>>>> hadoop -fs -tail also works just fine
>>>>
>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>> word count example, same thing, it is like if the file were empty for the
>>>> map reduce framework.
>>>>
>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>
>>>> I need some help around this.
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

That is exactly what I did, but in my case, it is like if the file were
empty, the job counters say no bytes read.
I'm using hadoop 1.0.3 which version did you try?

What I'm trying to do is just some basic analyitics on a product search
system. There is a search service, every time a user performs a search, the
search string, and the results are stored in this file, and the file is
sync'ed. I'm actually using pig to do some basic counts, it doesn't work,
like I described, because the file looks empty for the map reduce
components. I thought it was about pig, but I wasn't sure, so I tried a
simple mr job, and used the word count to test the map reduce compoinents
actually see the sync'ed bytes.

Of course if I close the file, everything works perfectly, but I don't want
to close the file every while, since that means I should create another one
(since no append support), and that would end up with too many tiny files,
something we know is bad for mr performance, and I don't want to add more
parts to this (like a file merging tool). I think unign sync is a clean
solution, since we don't care about writing performance, so I'd rather keep
it like this if I can make it work.

Any idea besides hadoop version?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala <
yhemanth@thoughtworks.com> wrote:

> Hi Lucas,
>
> I tried something like this but got different results.
>
> I wrote code that opened a file on HDFS, wrote a line and called sync.
> Without closing the file, I ran a wordcount with that file as input. It did
> work fine and was able to count the words that were sync'ed (even though
> the file length seems to come as 0 like you noted in fs -ls)
>
> So, not sure what's happening in your case. In the MR job, do the job
> counters indicate no bytes were read ?
>
> On a different note though, if you can describe a little more what you are
> trying to accomplish, we could probably work a better solution.
>
> Thanks
> hemanth
>
>
> On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:
>
>> Helo Hemanth, thanks for answering.
>> The file is open by a separate process not map reduce related at all. You
>> can think of it as a servlet, receiving requests, and writing them to this
>> file, every time a request is received it is written and
>> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>>
>> At the same time, I want to run a map reduce job over this file. Simply
>> runing the word count example doesn't seem to work, it is like if the file
>> were empty.
>>
>> hadoop -fs -tail works just fine, and reading the file using
>> org.apache.hadoop.fs.FSDataInputStream also works ok.
>>
>> Last thing, the web interface doesn't see the contents, and command
>> hadoop -fs -ls says the file is empty.
>>
>> What am I doing wrong?
>>
>> Thanks!
>>
>> Lucas
>>
>>
>>
>> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
>> yhemanth@thoughtworks.com> wrote:
>>
>>> Could you please clarify, are you opening the file in your mapper code
>>> and reading from there ?
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>>
>>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>>> file. The writing process, writes a line to the file and syncs the
>>>> file to readers.
>>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>>
>>>> If I try to read the file from another process, it works fine, at least
>>>> using
>>>> org.apache.hadoop.fs.FSDataInputStream.
>>>>
>>>> hadoop -fs -tail also works just fine
>>>>
>>>> But it looks like map reduce doesn't read any data. I tried using the
>>>> word count example, same thing, it is like if the file were empty for the
>>>> map reduce framework.
>>>>
>>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>>
>>>> I need some help around this.
>>>>
>>>> Thanks!
>>>>
>>>> Lucas
>>>>
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi Lucas,

I tried something like this but got different results.

I wrote code that opened a file on HDFS, wrote a line and called sync.
Without closing the file, I ran a wordcount with that file as input. It did
work fine and was able to count the words that were sync'ed (even though
the file length seems to come as 0 like you noted in fs -ls)

So, not sure what's happening in your case. In the MR job, do the job
counters indicate no bytes were read ?

On a different note though, if you can describe a little more what you are
trying to accomplish, we could probably work a better solution.

Thanks
hemanth


On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> Helo Hemanth, thanks for answering.
> The file is open by a separate process not map reduce related at all. You
> can think of it as a servlet, receiving requests, and writing them to this
> file, every time a request is received it is written and
> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>
> At the same time, I want to run a map reduce job over this file. Simply
> runing the word count example doesn't seem to work, it is like if the file
> were empty.
>
> hadoop -fs -tail works just fine, and reading the file using
> org.apache.hadoop.fs.FSDataInputStream also works ok.
>
> Last thing, the web interface doesn't see the contents, and command hadoop
> -fs -ls says the file is empty.
>
> What am I doing wrong?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Could you please clarify, are you opening the file in your mapper code
>> and reading from there ?
>>
>> Thanks
>> Hemanth
>>
>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>
>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>> file. The writing process, writes a line to the file and syncs the file
>>> to readers.
>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>
>>> If I try to read the file from another process, it works fine, at least
>>> using
>>> org.apache.hadoop.fs.FSDataInputStream.
>>>
>>> hadoop -fs -tail also works just fine
>>>
>>> But it looks like map reduce doesn't read any data. I tried using the
>>> word count example, same thing, it is like if the file were empty for the
>>> map reduce framework.
>>>
>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>
>>> I need some help around this.
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi Lucas,

I tried something like this but got different results.

I wrote code that opened a file on HDFS, wrote a line and called sync.
Without closing the file, I ran a wordcount with that file as input. It did
work fine and was able to count the words that were sync'ed (even though
the file length seems to come as 0 like you noted in fs -ls)

So, not sure what's happening in your case. In the MR job, do the job
counters indicate no bytes were read ?

On a different note though, if you can describe a little more what you are
trying to accomplish, we could probably work a better solution.

Thanks
hemanth


On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> Helo Hemanth, thanks for answering.
> The file is open by a separate process not map reduce related at all. You
> can think of it as a servlet, receiving requests, and writing them to this
> file, every time a request is received it is written and
> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>
> At the same time, I want to run a map reduce job over this file. Simply
> runing the word count example doesn't seem to work, it is like if the file
> were empty.
>
> hadoop -fs -tail works just fine, and reading the file using
> org.apache.hadoop.fs.FSDataInputStream also works ok.
>
> Last thing, the web interface doesn't see the contents, and command hadoop
> -fs -ls says the file is empty.
>
> What am I doing wrong?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Could you please clarify, are you opening the file in your mapper code
>> and reading from there ?
>>
>> Thanks
>> Hemanth
>>
>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>
>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>> file. The writing process, writes a line to the file and syncs the file
>>> to readers.
>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>
>>> If I try to read the file from another process, it works fine, at least
>>> using
>>> org.apache.hadoop.fs.FSDataInputStream.
>>>
>>> hadoop -fs -tail also works just fine
>>>
>>> But it looks like map reduce doesn't read any data. I tried using the
>>> word count example, same thing, it is like if the file were empty for the
>>> map reduce framework.
>>>
>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>
>>> I need some help around this.
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi Lucas,

I tried something like this but got different results.

I wrote code that opened a file on HDFS, wrote a line and called sync.
Without closing the file, I ran a wordcount with that file as input. It did
work fine and was able to count the words that were sync'ed (even though
the file length seems to come as 0 like you noted in fs -ls)

So, not sure what's happening in your case. In the MR job, do the job
counters indicate no bytes were read ?

On a different note though, if you can describe a little more what you are
trying to accomplish, we could probably work a better solution.

Thanks
hemanth


On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> Helo Hemanth, thanks for answering.
> The file is open by a separate process not map reduce related at all. You
> can think of it as a servlet, receiving requests, and writing them to this
> file, every time a request is received it is written and
> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>
> At the same time, I want to run a map reduce job over this file. Simply
> runing the word count example doesn't seem to work, it is like if the file
> were empty.
>
> hadoop -fs -tail works just fine, and reading the file using
> org.apache.hadoop.fs.FSDataInputStream also works ok.
>
> Last thing, the web interface doesn't see the contents, and command hadoop
> -fs -ls says the file is empty.
>
> What am I doing wrong?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Could you please clarify, are you opening the file in your mapper code
>> and reading from there ?
>>
>> Thanks
>> Hemanth
>>
>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>
>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>> file. The writing process, writes a line to the file and syncs the file
>>> to readers.
>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>
>>> If I try to read the file from another process, it works fine, at least
>>> using
>>> org.apache.hadoop.fs.FSDataInputStream.
>>>
>>> hadoop -fs -tail also works just fine
>>>
>>> But it looks like map reduce doesn't read any data. I tried using the
>>> word count example, same thing, it is like if the file were empty for the
>>> map reduce framework.
>>>
>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>
>>> I need some help around this.
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi Lucas,

I tried something like this but got different results.

I wrote code that opened a file on HDFS, wrote a line and called sync.
Without closing the file, I ran a wordcount with that file as input. It did
work fine and was able to count the words that were sync'ed (even though
the file length seems to come as 0 like you noted in fs -ls)

So, not sure what's happening in your case. In the MR job, do the job
counters indicate no bytes were read ?

On a different note though, if you can describe a little more what you are
trying to accomplish, we could probably work a better solution.

Thanks
hemanth


On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi <lu...@gmail.com> wrote:

> Helo Hemanth, thanks for answering.
> The file is open by a separate process not map reduce related at all. You
> can think of it as a servlet, receiving requests, and writing them to this
> file, every time a request is received it is written and
> org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.
>
> At the same time, I want to run a map reduce job over this file. Simply
> runing the word count example doesn't seem to work, it is like if the file
> were empty.
>
> hadoop -fs -tail works just fine, and reading the file using
> org.apache.hadoop.fs.FSDataInputStream also works ok.
>
> Last thing, the web interface doesn't see the contents, and command hadoop
> -fs -ls says the file is empty.
>
> What am I doing wrong?
>
> Thanks!
>
> Lucas
>
>
>
> On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>
>> Could you please clarify, are you opening the file in your mapper code
>> and reading from there ?
>>
>> Thanks
>> Hemanth
>>
>> On Friday, February 22, 2013, Lucas Bernardi wrote:
>>
>>> Hello there, I'm trying to use hadoop map reduce to process an open
>>> file. The writing process, writes a line to the file and syncs the file
>>> to readers.
>>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>>
>>> If I try to read the file from another process, it works fine, at least
>>> using
>>> org.apache.hadoop.fs.FSDataInputStream.
>>>
>>> hadoop -fs -tail also works just fine
>>>
>>> But it looks like map reduce doesn't read any data. I tried using the
>>> word count example, same thing, it is like if the file were empty for the
>>> map reduce framework.
>>>
>>> I'm using hadoop 1.0.3. and pig 0.10.0
>>>
>>> I need some help around this.
>>>
>>> Thanks!
>>>
>>> Lucas
>>>
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Helo Hemanth, thanks for answering.
The file is open by a separate process not map reduce related at all. You
can think of it as a servlet, receiving requests, and writing them to this
file, every time a request is received it is written and
org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

At the same time, I want to run a map reduce job over this file. Simply
runing the word count example doesn't seem to work, it is like if the file
were empty.

hadoop -fs -tail works just fine, and reading the file using
org.apache.hadoop.fs.FSDataInputStream also works ok.

Last thing, the web interface doesn't see the contents, and command hadoop
-fs -ls says the file is empty.

What am I doing wrong?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Could you please clarify, are you opening the file in your mapper code and
> reading from there ?
>
> Thanks
> Hemanth
>
> On Friday, February 22, 2013, Lucas Bernardi wrote:
>
>> Hello there, I'm trying to use hadoop map reduce to process an open file.
>> The writing process, writes a line to the file and syncs the file to
>> readers.
>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>
>> If I try to read the file from another process, it works fine, at least
>> using
>> org.apache.hadoop.fs.FSDataInputStream.
>>
>> hadoop -fs -tail also works just fine
>>
>> But it looks like map reduce doesn't read any data. I tried using the
>> word count example, same thing, it is like if the file were empty for the
>> map reduce framework.
>>
>> I'm using hadoop 1.0.3. and pig 0.10.0
>>
>> I need some help around this.
>>
>> Thanks!
>>
>> Lucas
>>
>

Re: mapr videos question

Posted by Sai Sai <sa...@yahoo.in>.

Hi
Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 

Just wondering if others can share their thoughts and any relevant links.
Thanks,
Sai

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Helo Hemanth, thanks for answering.
The file is open by a separate process not map reduce related at all. You
can think of it as a servlet, receiving requests, and writing them to this
file, every time a request is received it is written and
org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

At the same time, I want to run a map reduce job over this file. Simply
runing the word count example doesn't seem to work, it is like if the file
were empty.

hadoop -fs -tail works just fine, and reading the file using
org.apache.hadoop.fs.FSDataInputStream also works ok.

Last thing, the web interface doesn't see the contents, and command hadoop
-fs -ls says the file is empty.

What am I doing wrong?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Could you please clarify, are you opening the file in your mapper code and
> reading from there ?
>
> Thanks
> Hemanth
>
> On Friday, February 22, 2013, Lucas Bernardi wrote:
>
>> Hello there, I'm trying to use hadoop map reduce to process an open file.
>> The writing process, writes a line to the file and syncs the file to
>> readers.
>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>
>> If I try to read the file from another process, it works fine, at least
>> using
>> org.apache.hadoop.fs.FSDataInputStream.
>>
>> hadoop -fs -tail also works just fine
>>
>> But it looks like map reduce doesn't read any data. I tried using the
>> word count example, same thing, it is like if the file were empty for the
>> map reduce framework.
>>
>> I'm using hadoop 1.0.3. and pig 0.10.0
>>
>> I need some help around this.
>>
>> Thanks!
>>
>> Lucas
>>
>

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Helo Hemanth, thanks for answering.
The file is open by a separate process not map reduce related at all. You
can think of it as a servlet, receiving requests, and writing them to this
file, every time a request is received it is written and
org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

At the same time, I want to run a map reduce job over this file. Simply
runing the word count example doesn't seem to work, it is like if the file
were empty.

hadoop -fs -tail works just fine, and reading the file using
org.apache.hadoop.fs.FSDataInputStream also works ok.

Last thing, the web interface doesn't see the contents, and command hadoop
-fs -ls says the file is empty.

What am I doing wrong?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Could you please clarify, are you opening the file in your mapper code and
> reading from there ?
>
> Thanks
> Hemanth
>
> On Friday, February 22, 2013, Lucas Bernardi wrote:
>
>> Hello there, I'm trying to use hadoop map reduce to process an open file.
>> The writing process, writes a line to the file and syncs the file to
>> readers.
>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>
>> If I try to read the file from another process, it works fine, at least
>> using
>> org.apache.hadoop.fs.FSDataInputStream.
>>
>> hadoop -fs -tail also works just fine
>>
>> But it looks like map reduce doesn't read any data. I tried using the
>> word count example, same thing, it is like if the file were empty for the
>> map reduce framework.
>>
>> I'm using hadoop 1.0.3. and pig 0.10.0
>>
>> I need some help around this.
>>
>> Thanks!
>>
>> Lucas
>>
>

Re: mapr videos question

Posted by Sai Sai <sa...@yahoo.in>.

Hi
Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 

Just wondering if others can share their thoughts and any relevant links.
Thanks,
Sai

Re: mapr videos question

Posted by Sai Sai <sa...@yahoo.in>.

Hi
Could some one please verify if the mapr videos are meant for learning hadoop or is it for learning mapr. If we r interested in learning hadoop only then will they help. As a starter would like to just understand hadoop only and not mapr yet. 

Just wondering if others can share their thoughts and any relevant links.
Thanks,
Sai

Re: map reduce and sync

Posted by Lucas Bernardi <lu...@gmail.com>.

Helo Hemanth, thanks for answering.
The file is open by a separate process not map reduce related at all. You
can think of it as a servlet, receiving requests, and writing them to this
file, every time a request is received it is written and
org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked.

At the same time, I want to run a map reduce job over this file. Simply
runing the word count example doesn't seem to work, it is like if the file
were empty.

hadoop -fs -tail works just fine, and reading the file using
org.apache.hadoop.fs.FSDataInputStream also works ok.

Last thing, the web interface doesn't see the contents, and command hadoop
-fs -ls says the file is empty.

What am I doing wrong?

Thanks!

Lucas

On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Could you please clarify, are you opening the file in your mapper code and
> reading from there ?
>
> Thanks
> Hemanth
>
> On Friday, February 22, 2013, Lucas Bernardi wrote:
>
>> Hello there, I'm trying to use hadoop map reduce to process an open file.
>> The writing process, writes a line to the file and syncs the file to
>> readers.
>> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>>
>> If I try to read the file from another process, it works fine, at least
>> using
>> org.apache.hadoop.fs.FSDataInputStream.
>>
>> hadoop -fs -tail also works just fine
>>
>> But it looks like map reduce doesn't read any data. I tried using the
>> word count example, same thing, it is like if the file were empty for the
>> map reduce framework.
>>
>> I'm using hadoop 1.0.3. and pig 0.10.0
>>
>> I need some help around this.
>>
>> Thanks!
>>
>> Lucas
>>
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Could you please clarify, are you opening the file in your mapper code and
reading from there ?

Thanks
Hemanth

On Friday, February 22, 2013, Lucas Bernardi wrote:

> Hello there, I'm trying to use hadoop map reduce to process an open file. The
> writing process, writes a line to the file and syncs the file to readers.
> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>
> If I try to read the file from another process, it works fine, at least
> using
> org.apache.hadoop.fs.FSDataInputStream.
>
> hadoop -fs -tail also works just fine
>
> But it looks like map reduce doesn't read any data. I tried using the word
> count example, same thing, it is like if the file were empty for the map
> reduce framework.
>
> I'm using hadoop 1.0.3. and pig 0.10.0
>
> I need some help around this.
>
> Thanks!
>
> Lucas
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Could you please clarify, are you opening the file in your mapper code and
reading from there ?

Thanks
Hemanth

On Friday, February 22, 2013, Lucas Bernardi wrote:

> Hello there, I'm trying to use hadoop map reduce to process an open file. The
> writing process, writes a line to the file and syncs the file to readers.
> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>
> If I try to read the file from another process, it works fine, at least
> using
> org.apache.hadoop.fs.FSDataInputStream.
>
> hadoop -fs -tail also works just fine
>
> But it looks like map reduce doesn't read any data. I tried using the word
> count example, same thing, it is like if the file were empty for the map
> reduce framework.
>
> I'm using hadoop 1.0.3. and pig 0.10.0
>
> I need some help around this.
>
> Thanks!
>
> Lucas
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Could you please clarify, are you opening the file in your mapper code and
reading from there ?

Thanks
Hemanth

On Friday, February 22, 2013, Lucas Bernardi wrote:

> Hello there, I'm trying to use hadoop map reduce to process an open file. The
> writing process, writes a line to the file and syncs the file to readers.
> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>
> If I try to read the file from another process, it works fine, at least
> using
> org.apache.hadoop.fs.FSDataInputStream.
>
> hadoop -fs -tail also works just fine
>
> But it looks like map reduce doesn't read any data. I tried using the word
> count example, same thing, it is like if the file were empty for the map
> reduce framework.
>
> I'm using hadoop 1.0.3. and pig 0.10.0
>
> I need some help around this.
>
> Thanks!
>
> Lucas
>

Re: map reduce and sync

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Could you please clarify, are you opening the file in your mapper code and
reading from there ?

Thanks
Hemanth

On Friday, February 22, 2013, Lucas Bernardi wrote:

> Hello there, I'm trying to use hadoop map reduce to process an open file. The
> writing process, writes a line to the file and syncs the file to readers.
> (org.apache.hadoop.fs.FSDataOutputStream.sync()).
>
> If I try to read the file from another process, it works fine, at least
> using
> org.apache.hadoop.fs.FSDataInputStream.
>
> hadoop -fs -tail also works just fine
>
> But it looks like map reduce doesn't read any data. I tried using the word
> count example, same thing, it is like if the file were empty for the map
> reduce framework.
>
> I'm using hadoop 1.0.3. and pig 0.10.0
>
> I need some help around this.
>
> Thanks!
>
> Lucas
>