You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matteo Simoncini <si...@gmail.com> on 2012/09/19 10:07:30 UTC
tmp folder problem
Hi,
I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
I gave enought space to the "crawl" folder, the one where linkDB and
crawlDB go, and to the Solr folder.
It worked fine until 200.000 URL, but now I get an IOException that says
that there isn't enough memory.
Looking at the "crawl" folder or the Solr folder everything is fine. The
exeption was made because the temp folder (actually the temp/hadoop-root
folder) has become 14GB.
The solution of my problem I think of are:
1) Delete some tmp file. But which one and when.
2) Make nutch generate his tmp file in another directory (maybe
<nutch_folder>/tmp)
How can I do that? There is a third better solution?
Here is a copy of my script.
#!/bin/bash
# inject the initial seed into crawlDB
bin/nutch inject test/crawldb urls
# initialization of the variables
counter=1
error=0
#while there is no error
while [ $error -ne 1 ]
do
# crawl 500 URL
echo [ Script ] Starting generating phase
bin/nutch generate test/crawldb test/segments -topN 10000
if [ $? -ne 0 ]
then
echo [ Script ] Stopping: No more URLs to fetch.
error=1
break
fi
segment=`ls -d test/segments/2* | tail -1`
#fetching phase
echo [ Script ] Starting fetching phase
bin/nutch fetch $segment -threads 20
if [ $? -ne 0 ]
then
echo [ Script ] Fetch $segment failed. Deleting it.
rm -rf $segment
continue
fi
#parsing phase
echo [ Script ] Starting parsing phase
bin/nutch parse $segment
#updateDB phase
echo [ Script ] Starting updateDB phase
bin/nutch updatedb test/crawldb $segment
#indicizing with solr
bin/nutch invertlinks test/linkdb -dir test/segments
bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
test/linkdb test/segments/*
done
Thanks for your help.
Matteo
Re: tmp folder problem
Posted by Matteo Simoncini <si...@gmail.com>.
Thanks, you really helped a lot.
Matteo
2012/9/20 Sebastian Nagel <wa...@googlemail.com>
> Hi Matteo,
>
> have a look at the property hadoop.tmp.dir which allows you to direct
> the temp folder to another volume with more space on it.
> For "local" crawls:
> - do not share this folder for two simultaneously running Nutch jobs
> - you have to clean-up the temp folder, esp. after failed jobs
> (if no job is currently running with this folder defined as
> hadoop.tmp.dir
> a clean-up is save)
> Successful jobs do not leave any data in temp except for empty
> directories.
>
> Sebastian
>
> P.S.:
> Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki
> and
> the mailing lists.
>
>
> On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> > Hi,
> >
> > I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of
> url.
> > I gave enought space to the "crawl" folder, the one where linkDB and
> > crawlDB go, and to the Solr folder.
> >
> > It worked fine until 200.000 URL, but now I get an IOException that says
> > that there isn't enough memory.
> >
> > Looking at the "crawl" folder or the Solr folder everything is fine. The
> > exeption was made because the temp folder (actually the temp/hadoop-root
> > folder) has become 14GB.
> >
> > The solution of my problem I think of are:
> >
> > 1) Delete some tmp file. But which one and when.
> > 2) Make nutch generate his tmp file in another directory (maybe
> > <nutch_folder>/tmp)
> >
> > How can I do that? There is a third better solution?
> >
> > Here is a copy of my script.
> >
> > #!/bin/bash
> >
> > # inject the initial seed into crawlDB
> >
> > bin/nutch inject test/crawldb urls
> >
> >
> > # initialization of the variables
> >
> > counter=1
> >
> > error=0
> >
> >
> > #while there is no error
> >
> > while [ $error -ne 1 ]
> >
> > do
> >
> > # crawl 500 URL
> >
> > echo [ Script ] Starting generating phase
> >
> > bin/nutch generate test/crawldb test/segments -topN 10000
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Stopping: No more URLs to fetch.
> >
> > error=1
> >
> > break
> >
> > fi
> >
> > segment=`ls -d test/segments/2* | tail -1`
> >
> >
> > #fetching phase
> >
> > echo [ Script ] Starting fetching phase
> >
> > bin/nutch fetch $segment -threads 20
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Fetch $segment failed. Deleting it.
> >
> > rm -rf $segment
> >
> > continue
> >
> > fi
> >
> > #parsing phase
> >
> > echo [ Script ] Starting parsing phase
> >
> > bin/nutch parse $segment
> >
> >
> > #updateDB phase
> >
> > echo [ Script ] Starting updateDB phase
> >
> > bin/nutch updatedb test/crawldb $segment
> >
> >
> > #indicizing with solr
> >
> > bin/nutch invertlinks test/linkdb -dir test/segments
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> > test/linkdb test/segments/*
> >
> > done
> >
> >
> > Thanks for your help.
> >
> > Matteo
> >
>
>
Re: tmp folder problem
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Matteo,
have a look at the property hadoop.tmp.dir which allows you to direct
the temp folder to another volume with more space on it.
For "local" crawls:
- do not share this folder for two simultaneously running Nutch jobs
- you have to clean-up the temp folder, esp. after failed jobs
(if no job is currently running with this folder defined as hadoop.tmp.dir
a clean-up is save)
Successful jobs do not leave any data in temp except for empty directories.
Sebastian
P.S.:
Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki and
the mailing lists.
On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> Hi,
>
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
>
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
>
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
>
> The solution of my problem I think of are:
>
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
>
> How can I do that? There is a third better solution?
>
> Here is a copy of my script.
>
> #!/bin/bash
>
> # inject the initial seed into crawlDB
>
> bin/nutch inject test/crawldb urls
>
>
> # initialization of the variables
>
> counter=1
>
> error=0
>
>
> #while there is no error
>
> while [ $error -ne 1 ]
>
> do
>
> # crawl 500 URL
>
> echo [ Script ] Starting generating phase
>
> bin/nutch generate test/crawldb test/segments -topN 10000
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Stopping: No more URLs to fetch.
>
> error=1
>
> break
>
> fi
>
> segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
> echo [ Script ] Starting fetching phase
>
> bin/nutch fetch $segment -threads 20
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Fetch $segment failed. Deleting it.
>
> rm -rf $segment
>
> continue
>
> fi
>
> #parsing phase
>
> echo [ Script ] Starting parsing phase
>
> bin/nutch parse $segment
>
>
> #updateDB phase
>
> echo [ Script ] Starting updateDB phase
>
> bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
> bin/nutch invertlinks test/linkdb -dir test/segments
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> Thanks for your help.
>
> Matteo
>
Re: tmp folder problem
Posted by Matteo Simoncini <si...@gmail.com>.
Any advice?
Matteo
2012/9/19 Matteo Simoncini <si...@gmail.com>
> Hi,
>
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
>
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
>
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
>
> The solution of my problem I think of are:
>
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
>
> How can I do that? There is a third better solution?
>
> Here is a copy of my script.
>
> #!/bin/bash
>
> # inject the initial seed into crawlDB
>
> bin/nutch inject test/crawldb urls
>
>
> # initialization of the variables
>
> counter=1
>
> error=0
>
>
> #while there is no error
>
> while [ $error -ne 1 ]
>
> do
>
> # crawl 500 URL
>
> echo [ Script ] Starting generating phase
>
> bin/nutch generate test/crawldb test/segments -topN 10000
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Stopping: No more URLs to fetch.
>
> error=1
>
> break
>
> fi
>
> segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
> echo [ Script ] Starting fetching phase
>
> bin/nutch fetch $segment -threads 20
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Fetch $segment failed. Deleting it.
>
> rm -rf $segment
>
> continue
>
> fi
>
> #parsing phase
>
> echo [ Script ] Starting parsing phase
>
> bin/nutch parse $segment
>
>
> #updateDB phase
>
> echo [ Script ] Starting updateDB phase
>
> bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
> bin/nutch invertlinks test/linkdb -dir test/segments
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> Thanks for your help.
>
> Matteo
>
>
>