You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2009/08/26 23:33:10 UTC

content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB. But my MacPro with 50G free space did not start, after merge crashed with no space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can anyone let me know what must be put inside this file in order for merge not to take too much space.

Thanks in advance.
Alex.

how to effectively update index

Posted by al...@aim.com.
Hello,

I have a crawl folder with 2GB data and its index is 160MB. Then, nutch indexed another set of domains and its crawl folder is about 1MB. I wondered if there is an effective way making available for search indexes from both folders without using merge script, since merging large segments and indexes are resource consuming.

Thanks.
Alex.

Re: content of hadoop-site.xml

Posted by al...@aim.com.
 


 As I understood, you suggest to put segment files under segment folder and merge indexes. In that case my question is that why we need to merge segments, if we can go without merging them. In the mailing lists the only thing I found was changing settings in hadoop-site.xml, but it is empty. Could please provide some links.


Thanks.
Alex.
?


 

-----Original Message-----
From: MilleBii <mi...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Aug 27, 2009 2:39 am
Subject: Re: content of hadoop-site.xml










Not strange, look at the mailing list, their has been lot's of discussions
on this issue.
You may want to use the compress option.
And/or start using hadoop in pseudo-distributed, so that that reduce starts
consumming the map data, because in 'local' mode you get the map first & the
reduce after so their can be a lot of data in the tmp directory.

segment merge uses a LOT of space, so much that I don't use it anymore. I
only merge my indexes which are much smaller in my case.



2009/8/27 Fuad Efendi <fu...@efendi.ca>

> Unfortunately, you can't manage disk space usage via configuration
> parameters... it is not easy... just try to keep your eyes on
> services/processes/ram/swap (disk swapping happens if RAM is not enough)
> during merge, even browse file/folders and click 'refresh' button to get an
> idea... it is strange that 50G was not enough to merge 2G, may be problem
> is
> somewhere else (OS X specifics for instance)... try to play with Nutch with
> smaller segment sizes and study it's behaviour on your OS...
> -Fuad
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: August-26-09 6:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: content of hadoop-site.xml
>
>
>
>
>
>  Thanks for the response.
>
> How can I check disk swap?
> 50GB was before running merge command. When it crashed available space was
> 1
> kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop during
> merge and after that OS X does not start. I plan to run merge again and
> need
> to reduce disk space usage by merge. I have read on the net that for
> reducing space we must use hadoop-site.xml. But, there is no
> hadoop-default.xml file and hadoop-site.xml file is empty.
>
>
> Thanks.
> Alex.
>
>
>
>
> -----Original Message-----
> From: Fuad Efendi <fu...@efendi.ca>
> To: nutch-user@lucene.apache.org
> Sent: Wed, Aug 26, 2009 3:28 pm
> Subject: RE: content of hadoop-site.xml
>
>
>
>
>
>
>
>
>
>
> You can override default settings (nutch-default.xml) in nutch-site.xml;
> but
> it won't help with spacing; empty file is Ok.
>
> "merge" may generate temporary files, but 50Gb against 2Gb looks extremely
> strange; try to empty recycle bin for instance... check disk swap... OS may
> report 50G available but you may be out of space... for instance heavy disk
> swap during merge due to low RAM...
>
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: August-26-09 5:33 PM
> To: nutch-user@lucene.apache.org
> Subject: content of hadoop-site.xml
>
> Hello,
>
> ?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
> But my MacPro with 50G free space did not start, after merge crashed with
> no
> space error. I have been told that OSX got corrupted.
> I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
> anyone let me know what must be put inside this file in order for merge not
> to take too much space.
>
> Thanks in advance.
> Alex.
>
>
>
>
>
>
>
>
>
>


-- 
-MilleBii-



 


Re: content of hadoop-site.xml

Posted by MilleBii <mi...@gmail.com>.
Not strange, look at the mailing list, their has been lot's of discussions
on this issue.
You may want to use the compress option.
And/or start using hadoop in pseudo-distributed, so that that reduce starts
consumming the map data, because in 'local' mode you get the map first & the
reduce after so their can be a lot of data in the tmp directory.

segment merge uses a LOT of space, so much that I don't use it anymore. I
only merge my indexes which are much smaller in my case.



2009/8/27 Fuad Efendi <fu...@efendi.ca>

> Unfortunately, you can't manage disk space usage via configuration
> parameters... it is not easy... just try to keep your eyes on
> services/processes/ram/swap (disk swapping happens if RAM is not enough)
> during merge, even browse file/folders and click 'refresh' button to get an
> idea... it is strange that 50G was not enough to merge 2G, may be problem
> is
> somewhere else (OS X specifics for instance)... try to play with Nutch with
> smaller segment sizes and study it's behaviour on your OS...
> -Fuad
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: August-26-09 6:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: content of hadoop-site.xml
>
>
>
>
>
>  Thanks for the response.
>
> How can I check disk swap?
> 50GB was before running merge command. When it crashed available space was
> 1
> kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop during
> merge and after that OS X does not start. I plan to run merge again and
> need
> to reduce disk space usage by merge. I have read on the net that for
> reducing space we must use hadoop-site.xml. But, there is no
> hadoop-default.xml file and hadoop-site.xml file is empty.
>
>
> Thanks.
> Alex.
>
>
>
>
> -----Original Message-----
> From: Fuad Efendi <fu...@efendi.ca>
> To: nutch-user@lucene.apache.org
> Sent: Wed, Aug 26, 2009 3:28 pm
> Subject: RE: content of hadoop-site.xml
>
>
>
>
>
>
>
>
>
>
> You can override default settings (nutch-default.xml) in nutch-site.xml;
> but
> it won't help with spacing; empty file is Ok.
>
> "merge" may generate temporary files, but 50Gb against 2Gb looks extremely
> strange; try to empty recycle bin for instance... check disk swap... OS may
> report 50G available but you may be out of space... for instance heavy disk
> swap during merge due to low RAM...
>
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
>
>
> -----Original Message-----
> From: alxsss@aim.com [mailto:alxsss@aim.com]
> Sent: August-26-09 5:33 PM
> To: nutch-user@lucene.apache.org
> Subject: content of hadoop-site.xml
>
> Hello,
>
> ?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
> But my MacPro with 50G free space did not start, after merge crashed with
> no
> space error. I have been told that OSX got corrupted.
> I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
> anyone let me know what must be put inside this file in order for merge not
> to take too much space.
>
> Thanks in advance.
> Alex.
>
>
>
>
>
>
>
>
>
>


-- 
-MilleBii-

RE: content of hadoop-site.xml

Posted by Fuad Efendi <fu...@efendi.ca>.
Unfortunately, you can't manage disk space usage via configuration
parameters... it is not easy... just try to keep your eyes on
services/processes/ram/swap (disk swapping happens if RAM is not enough)
during merge, even browse file/folders and click 'refresh' button to get an
idea... it is strange that 50G was not enough to merge 2G, may be problem is
somewhere else (OS X specifics for instance)... try to play with Nutch with
smaller segment sizes and study it's behaviour on your OS...
-Fuad


-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com] 
Sent: August-26-09 6:41 PM
To: nutch-user@lucene.apache.org
Subject: Re: content of hadoop-site.xml


 


 Thanks for the response. 

How can I check disk swap? 
50GB was before running merge command. When it crashed available space was 1
kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop during
merge and after that OS X does not start. I plan to run merge again and need
to reduce disk space usage by merge. I have read on the net that for
reducing space we must use hadoop-site.xml. But, there is no
hadoop-default.xml file and hadoop-site.xml file is empty.


Thanks.
Alex.


 

-----Original Message-----
From: Fuad Efendi <fu...@efendi.ca>
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 26, 2009 3:28 pm
Subject: RE: content of hadoop-site.xml










You can override default settings (nutch-default.xml) in nutch-site.xml; but
it won't help with spacing; empty file is Ok.

"merge" may generate temporary files, but 50Gb against 2Gb looks extremely
strange; try to empty recycle bin for instance... check disk swap... OS may
report 50G available but you may be out of space... for instance heavy disk
swap during merge due to low RAM...



-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org


-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com] 
Sent: August-26-09 5:33 PM
To: nutch-user@lucene.apache.org
Subject: content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
But my MacPro with 50G free space did not start, after merge crashed with no
space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
anyone let me know what must be put inside this file in order for merge not
to take too much space.

Thanks in advance.
Alex.





 




Re: content of hadoop-site.xml

Posted by al...@aim.com.
 


 Thanks for the response. 

How can I check disk swap? 
50GB was before running merge command. When it crashed available space was 1 kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop during merge and after that OS X does not start. I plan to run merge again and need to reduce disk space usage by merge. I have read on the net that for reducing space we must use hadoop-site.xml. But, there is no hadoop-default.xml file and hadoop-site.xml file is empty.


Thanks.
Alex.


 

-----Original Message-----
From: Fuad Efendi <fu...@efendi.ca>
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 26, 2009 3:28 pm
Subject: RE: content of hadoop-site.xml










You can override default settings (nutch-default.xml) in nutch-site.xml; but
it won't help with spacing; empty file is Ok.

"merge" may generate temporary files, but 50Gb against 2Gb looks extremely
strange; try to empty recycle bin for instance... check disk swap... OS may
report 50G available but you may be out of space... for instance heavy disk
swap during merge due to low RAM...



-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org


-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com] 
Sent: August-26-09 5:33 PM
To: nutch-user@lucene.apache.org
Subject: content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
But my MacPro with 50G free space did not start, after merge crashed with no
space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
anyone let me know what must be put inside this file in order for merge not
to take too much space.

Thanks in advance.
Alex.





 


RE: content of hadoop-site.xml

Posted by Fuad Efendi <fu...@efendi.ca>.
You can override default settings (nutch-default.xml) in nutch-site.xml; but
it won't help with spacing; empty file is Ok.

"merge" may generate temporary files, but 50Gb against 2Gb looks extremely
strange; try to empty recycle bin for instance... check disk swap... OS may
report 50G available but you may be out of space... for instance heavy disk
swap during merge due to low RAM...



-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org


-----Original Message-----
From: alxsss@aim.com [mailto:alxsss@aim.com] 
Sent: August-26-09 5:33 PM
To: nutch-user@lucene.apache.org
Subject: content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
But my MacPro with 50G free space did not start, after merge crashed with no
space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
anyone let me know what must be put inside this file in order for merge not
to take too much space.

Thanks in advance.
Alex.