You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/09/19 10:07:30 UTC

tmp folder problem

Hi,

I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
I gave enought space to the "crawl" folder, the one where linkDB and
crawlDB go, and to the Solr folder.

It worked fine until 200.000 URL, but now I get an IOException that says
that there isn't enough memory.

Looking at the "crawl" folder or the Solr folder everything is fine. The
exeption was made because the temp folder (actually the temp/hadoop-root
folder) has become 14GB.

The solution of my problem I think of are:

1) Delete some tmp file. But which one and when.
2) Make nutch generate his tmp file in another directory (maybe
<nutch_folder>/tmp)

How can I do that? There is a third better solution?

Here is a copy of my script.

#!/bin/bash

# inject the initial seed into crawlDB

bin/nutch inject test/crawldb urls


# initialization of the variables

counter=1

error=0


#while there is no error

while [ $error -ne 1 ]

do

# crawl 500 URL

echo [ Script ] Starting generating phase

bin/nutch generate test/crawldb test/segments -topN 10000

if [ $? -ne 0 ]

then

echo [ Script ] Stopping: No more URLs to fetch.

error=1

break

fi

segment=`ls -d test/segments/2* | tail -1`


#fetching phase

echo [ Script ] Starting fetching phase

bin/nutch fetch $segment -threads 20

if [ $? -ne 0 ]

then

echo [ Script ] Fetch $segment failed. Deleting it.

rm -rf $segment

continue

fi

 #parsing phase

echo [ Script ] Starting parsing phase

bin/nutch parse $segment


#updateDB phase

echo [ Script ] Starting updateDB phase

bin/nutch updatedb test/crawldb $segment


#indicizing with solr

bin/nutch invertlinks test/linkdb -dir test/segments

bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
test/linkdb test/segments/*

done


Thanks for your help.

Matteo

Re: tmp folder problem

Posted by Matteo Simoncini <si...@gmail.com>.

Thanks, you really helped a lot.

Matteo

2012/9/20 Sebastian Nagel <wa...@googlemail.com>

> Hi Matteo,
>
> have a look at the property hadoop.tmp.dir which allows you to direct
> the temp folder to another volume with more space on it.
> For "local" crawls:
>  - do not share this folder for two simultaneously running Nutch jobs
>  - you have to clean-up the temp folder, esp. after failed jobs
>    (if no job is currently running with this folder defined as
> hadoop.tmp.dir
>     a clean-up is save)
>    Successful jobs do not leave any data in temp except for empty
> directories.
>
> Sebastian
>
> P.S.:
> Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki
> and
> the mailing lists.
>
>
> On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> > Hi,
> >
> > I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of
> url.
> > I gave enought space to the "crawl" folder, the one where linkDB and
> > crawlDB go, and to the Solr folder.
> >
> > It worked fine until 200.000 URL, but now I get an IOException that says
> > that there isn't enough memory.
> >
> > Looking at the "crawl" folder or the Solr folder everything is fine. The
> > exeption was made because the temp folder (actually the temp/hadoop-root
> > folder) has become 14GB.
> >
> > The solution of my problem I think of are:
> >
> > 1) Delete some tmp file. But which one and when.
> > 2) Make nutch generate his tmp file in another directory (maybe
> > <nutch_folder>/tmp)
> >
> > How can I do that? There is a third better solution?
> >
> > Here is a copy of my script.
> >
> > #!/bin/bash
> >
> > # inject the initial seed into crawlDB
> >
> > bin/nutch inject test/crawldb urls
> >
> >
> > # initialization of the variables
> >
> > counter=1
> >
> > error=0
> >
> >
> > #while there is no error
> >
> > while [ $error -ne 1 ]
> >
> > do
> >
> > # crawl 500 URL
> >
> > echo [ Script ] Starting generating phase
> >
> > bin/nutch generate test/crawldb test/segments -topN 10000
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Stopping: No more URLs to fetch.
> >
> > error=1
> >
> > break
> >
> > fi
> >
> > segment=`ls -d test/segments/2* | tail -1`
> >
> >
> > #fetching phase
> >
> > echo [ Script ] Starting fetching phase
> >
> > bin/nutch fetch $segment -threads 20
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Fetch $segment failed. Deleting it.
> >
> > rm -rf $segment
> >
> > continue
> >
> > fi
> >
> >  #parsing phase
> >
> > echo [ Script ] Starting parsing phase
> >
> > bin/nutch parse $segment
> >
> >
> > #updateDB phase
> >
> > echo [ Script ] Starting updateDB phase
> >
> > bin/nutch updatedb test/crawldb $segment
> >
> >
> > #indicizing with solr
> >
> > bin/nutch invertlinks test/linkdb -dir test/segments
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> > test/linkdb test/segments/*
> >
> > done
> >
> >
> > Thanks for your help.
> >
> > Matteo
> >
>
>

Re: tmp folder problem

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Matteo,

have a look at the property hadoop.tmp.dir which allows you to direct
the temp folder to another volume with more space on it.
For "local" crawls:
 - do not share this folder for two simultaneously running Nutch jobs
 - you have to clean-up the temp folder, esp. after failed jobs
   (if no job is currently running with this folder defined as hadoop.tmp.dir
    a clean-up is save)
   Successful jobs do not leave any data in temp except for empty directories.

Sebastian

P.S.:
Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki and
the mailing lists.


On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> Hi,
> 
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
> 
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
> 
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
> 
> The solution of my problem I think of are:
> 
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
> 
> How can I do that? There is a third better solution?
> 
> Here is a copy of my script.
> 
> #!/bin/bash
> 
> # inject the initial seed into crawlDB
> 
> bin/nutch inject test/crawldb urls
> 
> 
> # initialization of the variables
> 
> counter=1
> 
> error=0
> 
> 
> #while there is no error
> 
> while [ $error -ne 1 ]
> 
> do
> 
> # crawl 500 URL
> 
> echo [ Script ] Starting generating phase
> 
> bin/nutch generate test/crawldb test/segments -topN 10000
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Stopping: No more URLs to fetch.
> 
> error=1
> 
> break
> 
> fi
> 
> segment=`ls -d test/segments/2* | tail -1`
> 
> 
> #fetching phase
> 
> echo [ Script ] Starting fetching phase
> 
> bin/nutch fetch $segment -threads 20
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Fetch $segment failed. Deleting it.
> 
> rm -rf $segment
> 
> continue
> 
> fi
> 
>  #parsing phase
> 
> echo [ Script ] Starting parsing phase
> 
> bin/nutch parse $segment
> 
> 
> #updateDB phase
> 
> echo [ Script ] Starting updateDB phase
> 
> bin/nutch updatedb test/crawldb $segment
> 
> 
> #indicizing with solr
> 
> bin/nutch invertlinks test/linkdb -dir test/segments
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
> 
> done
> 
> 
> Thanks for your help.
> 
> Matteo
>

Re: tmp folder problem

Posted by Matteo Simoncini <si...@gmail.com>.

Any advice?

Matteo

2012/9/19 Matteo Simoncini <si...@gmail.com>

> Hi,
>
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
>
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
>
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
>
> The solution of my problem I think of are:
>
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
>
> How can I do that? There is a third better solution?
>
> Here is a copy of my script.
>
> #!/bin/bash
>
> # inject the initial seed into crawlDB
>
> bin/nutch inject test/crawldb urls
>
>
> # initialization of the variables
>
> counter=1
>
> error=0
>
>
> #while there is no error
>
> while [ $error -ne 1 ]
>
> do
>
>  # crawl 500 URL
>
>  echo [ Script ] Starting generating phase
>
>  bin/nutch generate test/crawldb test/segments -topN 10000
>
>  if [ $? -ne 0 ]
>
>  then
>
>  echo [ Script ] Stopping: No more URLs to fetch.
>
>  error=1
>
>  break
>
>  fi
>
>  segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
>  echo [ Script ] Starting fetching phase
>
>  bin/nutch fetch $segment -threads 20
>
>  if [ $? -ne 0 ]
>
>  then
>
>  echo [ Script ] Fetch $segment failed. Deleting it.
>
>  rm -rf $segment
>
>  continue
>
>  fi
>
>   #parsing phase
>
>  echo [ Script ] Starting parsing phase
>
>  bin/nutch parse $segment
>
>
>  #updateDB phase
>
>  echo [ Script ] Starting updateDB phase
>
>  bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
>  bin/nutch invertlinks test/linkdb -dir test/segments
>
>  bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> Thanks for your help.
>
> Matteo
>
>
>