You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by George <ad...@proservice.ge> on 2012/03/17 07:40:58 UTC
Fetching/Indexing process is taking a lot of time
Hello
I.m using nutch 9.0 default installation single machine:
2x2.5 quad core
16 GB ram
6 x 1TB sata raid 1
Network 1 gbps.
Not using any distributed file system.
Of cource have it configured
All headers
Threads : 100
Trying to crawl 30000 url-s with generate per site -1
fetching with :
--Script-------------------------------------------------------------------------------
#!/bin/bash
# runbot script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal
# LOCAL VARIABLES
cd /usr/local/nutch
export JAVA_HOME=/usr/local/java
export NUTCH_JAVA_HOME=/usr/local/java
export NUTCH_HEAPSIZE=2048
NUTCH_HOME=/usr/local/nutch
# if [ -e $NUTCH_HOME/nutch.tmp ]
# then
# echo "Index process found..."
# else
# date >> $NUTCH_HOME/nutch.tmp
depth=1
threads=100
adddays=30
topN=1000000 #Comment this statement if you don't want to set topN value
# Arguments for rm and mv
RMARGS="-rf"
MVARGS="-v"
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -n "$topN" ]
then
topN="-topN $topN"
else
topN=""
fi
steps=8
echo "----- Inject (Step 1 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for ((i=0; i <= depth ; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
/bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb
/home/crawl/segments $topN \
-adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d /home/crawl/segments/* | tail -1`
/bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth `expr $i + 1` failed."
echo "runbot: Deleting segment $segment."
rm $RMARGS $segment
continue
fi
/bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
#/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments
/home/crawl/segments/*
#if [ "$safe" != "yes" ]
#then
# rm $RMARGS /home/crawl/segments
#else
# rm $RMARGS /home/crawl/BACKUPsegments
# mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments
#fi
#mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments
echo "----- Invert Links (Step 4 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb
/home/crawl/segments/*
echo "----- Index (Step 5 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes
/home/crawl/crawldb /home/crawl/linkdb \
/home/crawl/segments/*
echo "----- Dedup (Step 6 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes
echo "----- Merge Indexes (Step 7 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex
/home/crawl/NEWindexes
echo "----- Loading New Index (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
rm $RMARGS /home/crawl/NEWindexes
rm $RMARGS /home/crawl/index
else
rm $RMARGS /home/crawl/BACKUPindexes
rm $RMARGS /home/crawl/BACKUPindex
mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes
mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex
fi
mv $MVARGS /home/crawl/NEWindex /home/crawl/index
#rm -f ${NUTCH_HOME}/nutch.tmp
/bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1
/bin/bash $NUTCH_HOME/bin/search.server stop
/bin/bash $NUTCH_HOME/bin/search.server start
echo "runbot: FINISHED: Crawl completed!"
echo ""
-----Script-----------------------------------------------------------------------------
all data is fetched to hadoop temporary directory "hadoop-root" to the
/home/crawl/hadoop-root
and after this step data is moving from /home/ctawl/hadoop-root to
/home/ctawl/segments/xxxxxxxxxxx
and this step taking a lot of time and depend on size it can take a week or
more
on this step data is moving with wery low speed 500 kb/ps (sorry i dont know
what it is doing on this step, I'm just user and have no java programing or
hadoop experiance)
Is there any way to make this step faster?
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3834059.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by George <ad...@proservice.ge>.
fetching
http://xxxxxxx.xxx/1-/color-%E1%83%97%E1%83%94%E1%83%97%E1%83%A0%E1%83%98/make-Caterham/1/listings.html
fetch of http://xxxxxxx.xxx/index.php?cat=27¤tpage=5 failed with:
java.net.SocketTimeoutException: connect timed out
fetch of http://xxxxxxx.xxx/gwdict/index.php?a=term&d=1&t=179856 failed
with: java.net.SocketException: Connection reset
fetch of http://wxxxxxxx.xxx/gwdict/index.php?a=term&d=1&t=72596 failed
with: java.net.SocketException: Connection reset
Error parsing:
http://xxxxxxx.xxx/pt/phpThumb.php?src=../movies/screens/mov_63_59472.jpg&w=125&h=90&zc=1:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
contentTyp
e=image/jpeg
url=http://xxxxxxx.xxx/pt/phpThumb.php?src=../movies/screens/mov_63_59472.jpg&w=125&h=90&zc=1
Error parsing: http://xxxxxxx.xxx/albumimage-159: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/png url=http://xxxxxxx.xxx/album
image-159
Error parsing:
http://xxxxxxx.xxx/show_image_trnsArchive.php?filename=/2010/09/source1238.jpg&cat=1&pid=26373&cache=true:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
co
ntentType=image/jpeg
url=http://xxxxxxx.xxx/show_image_trnsArchive.php?filename=/2010/09/source1238.jpg&cat=1&pid=26373&cache=true
*<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< this step takes a long time
about 4-5 days (this is 90 Gb fetched data)*
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: /home/crawl/crawldb
CrawlDb update: segments: [/home/crawl/segments/20120321210937]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
----- Merge Segments (Step 3 of 8) -----
----- Invert Links (Step 4 of 8) -----
LinkDb: starting
LinkDb: linkdb: /home/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: /home/crawl/segments/20120316115027
<<<<< 99 MB
LinkDb: adding segment: /home/crawl/segments/20120316121103
<<<<< 3 GB
LinkDb: adding segment: /home/crawl/segments/20120316193047
<<<<< 20 GB
LinkDb: adding segment: /home/crawl/segments/20120317162508
<<<<< 20 GB
LinkDb: adding segment: /home/crawl/segments/20120321210937
<<<<< 90 GB
LinkDb: merging with existing linkdb: /home/crawl/linkdb <<<<< *and
this step also*
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3860920.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by Mathijs Homminga <ma...@kalooga.com>.
Hi George,
Just to be sure:
Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed within the 'fetch' step that this issue occurs?
So, _after_ the Fetcher logs the message "Fetcher: starting" and _before_ the Fetcher logs the message "Fetcher: done"?
If so, it indeed looks like Hadoop is moving your temporary data at very low rates.
Mathijs
On 19 mrt. 2012, at 03:20, George wrote:
> You are rght I'm using Nutch 0.9
> Thank you for sugestion but i need help with this version.
> Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
> disks.
> Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
> There is no hardware problem.
>
> May be I have not configured something or my fetching script doing that (I
> have not found such function in it) don't know.
> I just need to know why fetched data is going to temporary directory and
> then is moved to the segment at wery low speed.
>
> My >> hadoop-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>
>
> <configuration>
> <property>
> <name>hadoop.tmp.dir</name>
> <value>/home/crawl/hadoop-${user.name}</value>
> <description>Hadoop temp directory</description>
> </property>
> </configuration>
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by George <ad...@proservice.ge>.
- The Fetch job has a mapper which does the fetching, and has a reducer which
copies the fetched data to de segment dir. Is it this step where you see the
problem?
yes
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3841307.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by Mathijs Homminga <ma...@kalooga.com>.
Which version of Hadoop are you using?
In your script, I see that you have a section called "---- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
At which of these sub steps do you see your problem?
For example: (from the top of my head)
- The Fetch job has a mapper which does the fetching, and has a reducer which copies the fetched data to de segment dir. Is it this step where you see the problem?
- The Update job creates a new crawldb and then moves it to the final destination.
Mathijs
On Mar 19, 2012, at 3:20 , George wrote:
> You are rght I'm using Nutch 0.9
> Thank you for sugestion but i need help with this version.
> Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
> disks.
> Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
> There is no hardware problem.
>
> May be I have not configured something or my fetching script doing that (I
> have not found such function in it) don't know.
> I just need to know why fetched data is going to temporary directory and
> then is moved to the segment at wery low speed.
>
> My >> hadoop-site.xml
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>
>
> <configuration>
> <property>
> <name>hadoop.tmp.dir</name>
> <value>/home/crawl/hadoop-${user.name}</value>
> <description>Hadoop temp directory</description>
> </property>
> </configuration>
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by George <ad...@proservice.ge>.
You are rght I'm using Nutch 0.9
Thank you for sugestion but i need help with this version.
Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
disks.
Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
There is no hardware problem.
May be I have not configured something or my fetching script doing that (I
have not found such function in it) don't know.
I just need to know why fetched data is going to temporary directory and
then is moved to the segment at wery low speed.
My >> hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/crawl/hadoop-${user.name}</value>
<description>Hadoop temp directory</description>
</property>
</configuration>
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by Mathijs Homminga <ma...@kalooga.com>.
Hmmm...
First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That version is almost 5 years old. I really suggest that you update to 1.4.
What if you manually move such amounts of data on your disks? Same low speed? (btw, do you really have raid 1 (mirroring) on 6 disks?)
Cheers,
Mathijs
On 17 mrt. 2012, at 20:59, George wrote:
> no
>
> for example if i run dept 3
>
> it fetching data to hadoop temporary directory then moving data to new
> segment
> and do this cycles 3 times
>
> all data is fetched to dadoop-root (temporary hadoop directory)
> and then nutch is moving this data to the segment dir in segment folder.
> and for example moving data is taking:
>
> first fetch is in about 3 gb moving in 0.30-2 hours
> second becomes 10-15 Gb and moving takes 10-12 hours
> third cycle become 20-25 Gb and moving takes 5-7 days may be more on more
> depts.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by George <ad...@proservice.ge>.
no
for example if i run dept 3
it fetching data to hadoop temporary directory then moving data to new
segment
and do this cycles 3 times
all data is fetched to dadoop-root (temporary hadoop directory)
and then nutch is moving this data to the segment dir in segment folder.
and for example moving data is taking:
first fetch is in about 3 gb moving in 0.30-2 hours
second becomes 10-15 Gb and moving takes 10-12 hours
third cycle become 20-25 Gb and moving takes 5-7 days may be more on more
depts.
--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching/Indexing process is taking a lot of time
Posted by Mathijs Homminga <ma...@kalooga.com>.
Hi,
Your hardware looks okay.
Moving data from 30,000 urls takes a week at 500kb/s?
That would mean ~10Mb per url. Could that be right?
Anyway, can you tell us at what stage your crawl script is when this kicks in?
Mathijs
On 17 mrt. 2012, at 07:40, George wrote:
> Hello
>
> I.m using nutch 9.0 default installation single machine:
> 2x2.5 quad core
> 16 GB ram
> 6 x 1TB sata raid 1
> Network 1 gbps.
> Not using any distributed file system.
>
> Of cource have it configured
> All headers
> Threads : 100
>
> Trying to crawl 30000 url-s with generate per site -1
>
> fetching with :
>
> --Script-------------------------------------------------------------------------------
>
> #!/bin/bash
>
> # runbot script to run the Nutch bot for crawling and re-crawling.
> # Usage: bin/runbot [safe]
> # If executed in 'safe' mode, it doesn't delete the temporary
> # directories generated during crawl. This might be helpful for
> # analysis and recovery in case a crawl fails.
> #
> # Author: Susam Pal
>
> # LOCAL VARIABLES
> cd /usr/local/nutch
>
> export JAVA_HOME=/usr/local/java
> export NUTCH_JAVA_HOME=/usr/local/java
>
> export NUTCH_HEAPSIZE=2048
>
> NUTCH_HOME=/usr/local/nutch
>
> # if [ -e $NUTCH_HOME/nutch.tmp ]
> # then
> # echo "Index process found..."
> # else
> # date >> $NUTCH_HOME/nutch.tmp
>
>
> depth=1
> threads=100
> adddays=30
> topN=1000000 #Comment this statement if you don't want to set topN value
>
> # Arguments for rm and mv
> RMARGS="-rf"
> MVARGS="-v"
>
> # Parse arguments
> if [ "$1" == "safe" ]
> then
> safe=yes
> fi
>
> if [ -z "$NUTCH_HOME" ]
> then
> NUTCH_HOME=.
> echo runbot: $0 could not find environment variable NUTCH_HOME
> echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
> else
> echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
> fi
>
> if [ -n "$topN" ]
> then
> topN="-topN $topN"
> else
> topN=""
> fi
>
> steps=8
> echo "----- Inject (Step 1 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>
> for ((i=0; i <= depth ; i++))
> do
> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> /bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb
> /home/crawl/segments $topN \
> -adddays $adddays
> if [ $? -ne 0 ]
> then
> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> break
> fi
> segment=`ls -d /home/crawl/segments/* | tail -1`
>
> /bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> if [ $? -ne 0 ]
> then
> echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> echo "runbot: Deleting segment $segment."
> rm $RMARGS $segment
> continue
> fi
>
> /bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment
> done
>
> echo "----- Merge Segments (Step 3 of $steps) -----"
> #/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments
> /home/crawl/segments/*
> #if [ "$safe" != "yes" ]
> #then
> # rm $RMARGS /home/crawl/segments
> #else
> # rm $RMARGS /home/crawl/BACKUPsegments
> # mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments
> #fi
>
> #mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments
>
> echo "----- Invert Links (Step 4 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb
> /home/crawl/segments/*
>
> echo "----- Index (Step 5 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes
> /home/crawl/crawldb /home/crawl/linkdb \
> /home/crawl/segments/*
>
> echo "----- Dedup (Step 6 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes
>
> echo "----- Merge Indexes (Step 7 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex
> /home/crawl/NEWindexes
>
> echo "----- Loading New Index (Step 8 of $steps) -----"
>
> if [ "$safe" != "yes" ]
> then
> rm $RMARGS /home/crawl/NEWindexes
> rm $RMARGS /home/crawl/index
> else
> rm $RMARGS /home/crawl/BACKUPindexes
> rm $RMARGS /home/crawl/BACKUPindex
> mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes
> mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex
> fi
>
> mv $MVARGS /home/crawl/NEWindex /home/crawl/index
>
> #rm -f ${NUTCH_HOME}/nutch.tmp
>
> /bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1
>
> /bin/bash $NUTCH_HOME/bin/search.server stop
> /bin/bash $NUTCH_HOME/bin/search.server start
>
> echo "runbot: FINISHED: Crawl completed!"
> echo ""
>
> -----Script-----------------------------------------------------------------------------
>
> all data is fetched to hadoop temporary directory "hadoop-root" to the
> /home/crawl/hadoop-root
> and after this step data is moving from /home/ctawl/hadoop-root to
> /home/ctawl/segments/xxxxxxxxxxx
> and this step taking a lot of time and depend on size it can take a week or
> more
> on this step data is moving with wery low speed 500 kb/ps (sorry i dont know
> what it is doing on this step, I'm just user and have no java programing or
> hadoop experiance)
>
> Is there any way to make this step faster?
>
> Thanks
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3834059.html
> Sent from the Nutch - User mailing list archive at Nabble.com.