You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/11 07:13:52 UTC

Does hadoop not reclaim blocks when files are deleted?

Greetings list,

This is my DFS report:

Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 302794461922 (281.99 Gb)
% used: 42.68%

Total effective bytes: 11826067632 (11.01 Gb)
Effective replication multiplier: 25.6039853097637

These numbers seem to me to be completely insane -- a 25 times 
replication of blocks. I have my replication factor set to 3.

"Used raw bytes" goes up when I run jobs, and if I delete files those 
jobs produce within DFS (e.g. a segment for a failed fetch), it doesn't 
appear that hadoop immediately reclaims the space used by the deleted 
files' blocks.

Am I right? Is this a bug?

-Shawn

Re: Does hadoop not reclaim blocks when files are deleted?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Shawn Gervais wrote:
> Andrzej Bialecki wrote:
>> Shawn Gervais wrote:
>>> Greetings list,
>>>
>>> This is my DFS report:
>>>
>>> Total raw bytes: 709344133120 (660.62 Gb)
>>> Used raw bytes: 302794461922 (281.99 Gb)
>>> % used: 42.68%
>>>
>>> Total effective bytes: 11826067632 (11.01 Gb)
>>> Effective replication multiplier: 25.6039853097637
>>>
>>> These numbers seem to me to be completely insane -- a 25 times 
>>> replication of blocks. I have my replication factor set to 3.
>>>
>>> "Used raw bytes" goes up when I run jobs, and if I delete files 
>>> those jobs produce within DFS (e.g. a segment for a failed fetch), 
>>> it doesn't appear that hadoop immediately reclaims the space used by 
>>> the deleted files' blocks.
>>>
>>> Am I right? Is this a bug?
>>
>> What does 'hadoop fsck /' say?
>
> Status: HEALTHY
>  Total size:    5265553979 B
>  Total blocks:  1952 (avg. block size 2697517 B)
>  Total dirs:    333
>  Total files:   1868
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Target replication factor:     3
>  Real replication factor:       3.0
>
> For reference, the head of the "dfs -report" output taken at the same 
> time -- my dfs has changed since the above mail.
>
> Total raw bytes: 709344133120 (660.62 Gb)
> Used raw bytes: 68715927635 (63.99 Gb)
> % used: 9.68%
>
> Total effective bytes: 5265553979 (4.90 Gb)
> Effective replication multiplier: 13.050085120967667
>
> I am still confused by the wildly different replication factors seen 
> here.
>
> It occurs to me that Hadoop might be counting as "used raw bytes" 
> other files that are not in the DFS filestore, but are on the same 
> partition. For example, my Nutch installation and some other files are 
> on the same partition as the DFS filestore.

I'm not sure what is happening here... I suggest to escalate this issue 
on the Hadoop list. All I know is that the number from fsck comes from 
"manual" counting of blocks in each file and. their reported replicated 
location counts from datanodes - which seems to me like a more reliable 
way of calculating this value. However, you cannot learn from this how 
much free space remains available.

'dfs -report' on the other hand uses the native df(1) utility to read 
total/available/used disk space, regardless of whether it's allocated 
for DFS blocks or not (because it doesn't use du(1), which works on a 
per-directory level, instead it uses df(1) which works on a per-mount 
level).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Does hadoop not reclaim blocks when files are deleted?

Posted by Shawn Gervais <pr...@project10.net>.
Andrzej Bialecki wrote:
> Shawn Gervais wrote:
>> Greetings list,
>>
>> This is my DFS report:
>>
>> Total raw bytes: 709344133120 (660.62 Gb)
>> Used raw bytes: 302794461922 (281.99 Gb)
>> % used: 42.68%
>>
>> Total effective bytes: 11826067632 (11.01 Gb)
>> Effective replication multiplier: 25.6039853097637
>>
>> These numbers seem to me to be completely insane -- a 25 times 
>> replication of blocks. I have my replication factor set to 3.
>>
>> "Used raw bytes" goes up when I run jobs, and if I delete files those 
>> jobs produce within DFS (e.g. a segment for a failed fetch), it 
>> doesn't appear that hadoop immediately reclaims the space used by the 
>> deleted files' blocks.
>>
>> Am I right? Is this a bug?
> 
> What does 'hadoop fsck /' say?

Status: HEALTHY
  Total size:    5265553979 B
  Total blocks:  1952 (avg. block size 2697517 B)
  Total dirs:    333
  Total files:   1868
  Over-replicated blocks:        0 (0.0 %)
  Under-replicated blocks:       0 (0.0 %)
  Target replication factor:     3
  Real replication factor:       3.0

For reference, the head of the "dfs -report" output taken at the same 
time -- my dfs has changed since the above mail.

Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 68715927635 (63.99 Gb)
% used: 9.68%

Total effective bytes: 5265553979 (4.90 Gb)
Effective replication multiplier: 13.050085120967667

I am still confused by the wildly different replication factors seen here.

It occurs to me that Hadoop might be counting as "used raw bytes" other 
files that are not in the DFS filestore, but are on the same partition. 
For example, my Nutch installation and some other files are on the same 
partition as the DFS filestore.

Regards,
-Shawn

RE: Small dev question

Posted by Gal Nitzan <gn...@usa.net>.
Thank you very much for your prompt reply.

I see what you mean.

Regards,

Gal.


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Tuesday, April 11, 2006 12:17 PM
To: nutch-user@lucene.apache.org
Subject: Re: Small dev question

Gal Nitzan wrote:
> Hi Andrzej,
>
> I have two questions in regards to ParseOutputFormat.java:
>
> 1. On line 102 a String[] is used. Do you think it might be better to use
a
> ListArray? It will save a few cycles down the road -- it shall save you to
> use "validCount" and will save you the "if" on line 121. I can make a
patch
> if you think I'm correct on this.
>   

I doubt it would save anything, and even if, the savings would be 
negligible. Creating a new entry in ListArray and hooking it up to the 
list has some cost, too.

> 2. If I understand the functionality correct, on line 87 a new CrawlDatum
is
> created for the fetched page. The interval is set to 0.0. Could you please
> explain why it is set to 0.0?
>   
That's only a special additional CrawlDatum, which serves as a signature
container. You see, if we don't parse at the same time as we fetch then we
can't put the signature in the same CrawlDatum (see the logic in
Fetcher.FetcherThread.output()), so we need another instance, to pick up the
signature when running updatedb.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.
Follow these steps for nutch-0.7.2:

(1) Modify the nutch-default.xml for the following property
For ex: if you want to include "doc" file type, replace the <value> node to
"parse-(text|html|doc)" as shown below.

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

(2) The next step is to develop the appropriate plugin for the particular
file. The parse needs to implement the interface "Parser" (
org.apache.nutch.parse )in nutch.

More details can be found in the following link
http://wiki.apache.org/nutch/WritingPluginExample

(3) Modify the plugin.xml. The link above describes everything in detail.
Here is an example plugin.xml I wrote for XHTML parser. Observe the
"contentType" which matches the file type you are trying to parse.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0"
provider-name="dessci.com">

    <runtime>
      <library name="parse-xhtml.jar">
         <export name="*"/>
      </library>
      <library name="nekohtml-0.9.4.jar"/>
      <library name="tagsoup-1.0rc3.jar"/>
   </runtime>

   <extension id="com.dessci.search.nutch.parse.xhtml"
              name="XhtmlParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser"
                      class="com.dessci.search.nutch.parse.xhtml.XhtmlParser
"
                      contentType="application/xhtml+xml"
                      pathSuffix=""/>

   </extension>

</plugin>



Hope this helps,

--Rajesh Munavalli
On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Hi, it's me again,
>
> If I'm going to use Nutch, I need xls, ppt, & doc file
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
>
> Thanks,
> Bob
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.
Have a look at http://jakarta.apache.org/poi/

On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?
>
> --- J�r�me Charron <je...@gmail.com>
> wrote:
>
> > > types to be searchable if at all possible. The
> > wiki
> > > says most file types are disabled by default, but
> > they
> > > can be turned on by changing conf/nutch-site.xml.
> > > Unfortunately there is no documentation that I can
> > > find for this file... any ideas how to do it, or
> > > sample xml that somebody could send over?
> >
> > Simply add the plugin name in the plugin.includes
> > property.
> > For instance, to activate word, powerpoint and excel
> > parsing, just add in
> > this property :
> > ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> > ...
> > or in a shorter syntax :
> > ... |parse-ms(excel|powerpoint|word)| ...
> >
> > This is described on the Wiki in the page :
> > http://wiki.apache.org/nutch/WritingPluginExample
> > Section "Getting Nutch to Use Your Plugin"
> >
> >
> > Regards
> >
> > J�r�me
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?

They are available in the trunk version, not in the 0.7.x

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Enabling different file types

Posted by bob knob <an...@yahoo.com>.
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel & powerpoint? 

--- J�r�me Charron <je...@gmail.com>
wrote:

> > types to be searchable if at all possible. The
> wiki
> > says most file types are disabled by default, but
> they
> > can be turned on by changing conf/nutch-site.xml.
> > Unfortunately there is no documentation that I can
> > find for this file... any ideas how to do it, or
> > sample xml that somebody could send over?
> 
> Simply add the plugin name in the plugin.includes
> property.
> For instance, to activate word, powerpoint and excel
> parsing, just add in
> this property :
> ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> ...
> or in a shorter syntax :
> ... |parse-ms(excel|powerpoint|word)| ...
> 
> This is described on the Wiki in the page :
> http://wiki.apache.org/nutch/WritingPluginExample
> Section "Getting Nutch to Use Your Plugin"
> 
> 
> Regards
> 
> J�r�me
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?

Simply add the plugin name in the plugin.includes property.
For instance, to activate word, powerpoint and excel parsing, just add in
this property :
... |parse-msexcel|parse-mspowerpoint|parse-msword| ...
or in a shorter syntax :
... |parse-ms(excel|powerpoint|word)| ...

This is described on the Wiki in the page :
http://wiki.apache.org/nutch/WritingPluginExample
Section "Getting Nutch to Use Your Plugin"


Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Enabling different file types

Posted by bob knob <an...@yahoo.com>.
Hi, it's me again,

If I'm going to use Nutch, I need xls, ppt, & doc file
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this file... any ideas how to do it, or
sample xml that somebody could send over?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Auto-crawling & re-crawling the web site

Posted by bob knob <an...@yahoo.com>.
Hi,

I am currently evaluating Nutch for use on an intranet
site search engine. I am by no means an expert in this
field although I am trying to learn more about it.

1 I was reading one of the articles referenced on the
nutch site:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

-and I was a little bit concerned about its warning
concerning "re-crawling" the site. I understand that
there are several steps of crawling, building the
index, etc., but it sounded to me like new pages on my
web site would be ignored until I restarted the Nutch
server even after I've re-crawled. Am I correct about
this? How do most people deal with it?

2 It seems like I would want to re-crawl or re-index
the site on a nightly basis. All of this seems to be
done with shell scripts, and I wonder what options are
available to someone working on a Windows platform. I
could run cygrunsrv/cron on Windows I guess. Is there
some reason more of this scripting couldn't be redone
as a Java program? Also, has anybody considered
creating a Windows service to manage indexing/crawling
like the one that manages the Tomcat web server?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Small dev question

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Hi Andrzej,
>
> I have two questions in regards to ParseOutputFormat.java:
>
> 1. On line 102 a String[] is used. Do you think it might be better to use a
> ListArray? It will save a few cycles down the road -- it shall save you to
> use "validCount" and will save you the "if" on line 121. I can make a patch
> if you think I'm correct on this.
>   

I doubt it would save anything, and even if, the savings would be 
negligible. Creating a new entry in ListArray and hooking it up to the 
list has some cost, too.

> 2. If I understand the functionality correct, on line 87 a new CrawlDatum is
> created for the fetched page. The interval is set to 0.0. Could you please
> explain why it is set to 0.0?
>   
That's only a special additional CrawlDatum, which serves as a signature container. You see, if we don't parse at the same time as we fetch then we can't put the signature in the same CrawlDatum (see the logic in Fetcher.FetcherThread.output()), so we need another instance, to pick up the signature when running updatedb.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Small dev question

Posted by Gal Nitzan <gn...@usa.net>.
Hi Andrzej,

I have two questions in regards to ParseOutputFormat.java:

1. On line 102 a String[] is used. Do you think it might be better to use a
ListArray? It will save a few cycles down the road -- it shall save you to
use "validCount" and will save you the "if" on line 121. I can make a patch
if you think I'm correct on this.

2. If I understand the functionality correct, on line 87 a new CrawlDatum is
created for the fetched page. The interval is set to 0.0. Could you please
explain why it is set to 0.0?

Thanks in advance,

Gal.




Re: Does hadoop not reclaim blocks when files are deleted?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Shawn Gervais wrote:
> Greetings list,
>
> This is my DFS report:
>
> Total raw bytes: 709344133120 (660.62 Gb)
> Used raw bytes: 302794461922 (281.99 Gb)
> % used: 42.68%
>
> Total effective bytes: 11826067632 (11.01 Gb)
> Effective replication multiplier: 25.6039853097637
>
> These numbers seem to me to be completely insane -- a 25 times 
> replication of blocks. I have my replication factor set to 3.
>
> "Used raw bytes" goes up when I run jobs, and if I delete files those 
> jobs produce within DFS (e.g. a segment for a failed fetch), it 
> doesn't appear that hadoop immediately reclaims the space used by the 
> deleted files' blocks.
>
> Am I right? Is this a bug?

What does 'hadoop fsck /' say?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com