You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by George <ad...@proservice.ge> on 2012/03/17 07:40:58 UTC

Fetching/Indexing process is taking a lot of time

Hello

I.m using nutch 9.0 default installation single machine:
2x2.5 quad core
16 GB ram
6 x 1TB sata raid 1
Network 1 gbps.
Not using any distributed file system.

Of cource have it configured
All headers
Threads : 100

Trying to crawl 30000 url-s with generate per site -1

fetching with :

--Script-------------------------------------------------------------------------------

#!/bin/bash

# runbot script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/runbot [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

# LOCAL VARIABLES
cd /usr/local/nutch

    export JAVA_HOME=/usr/local/java
    export NUTCH_JAVA_HOME=/usr/local/java

    export NUTCH_HEAPSIZE=2048

    NUTCH_HOME=/usr/local/nutch

#    if [ -e $NUTCH_HOME/nutch.tmp ]
#        then
#        echo "Index process found..."
#    else
#        date >> $NUTCH_HOME/nutch.tmp


depth=1
threads=100
adddays=30
topN=1000000 #Comment this statement if you don't want to set topN value

# Arguments for rm and mv
RMARGS="-rf"
MVARGS="-v"

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -n "$topN" ]
then
  topN="-topN $topN"
else
  topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

for ((i=0; i <= depth ; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  /bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb
/home/crawl/segments $topN \
      -adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d /home/crawl/segments/* | tail -1`

  /bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
    echo "runbot: Deleting segment $segment."
    rm $RMARGS $segment
    continue
  fi

  /bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
#/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments
/home/crawl/segments/*
#if [ "$safe" != "yes" ]
#then
#  rm $RMARGS /home/crawl/segments
#else
#  rm $RMARGS /home/crawl/BACKUPsegments
#  mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments
#fi

#mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments

echo "----- Invert Links (Step 4 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb
/home/crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes
/home/crawl/crawldb /home/crawl/linkdb \
    /home/crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
/bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex
/home/crawl/NEWindexes

echo "----- Loading New Index (Step 8 of $steps) -----"

if [ "$safe" != "yes" ]
then
  rm $RMARGS /home/crawl/NEWindexes
  rm $RMARGS /home/crawl/index
else
  rm $RMARGS /home/crawl/BACKUPindexes
  rm $RMARGS /home/crawl/BACKUPindex
  mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes
  mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex
fi

mv $MVARGS /home/crawl/NEWindex /home/crawl/index

#rm -f ${NUTCH_HOME}/nutch.tmp

    /bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1

    /bin/bash $NUTCH_HOME/bin/search.server stop
    /bin/bash $NUTCH_HOME/bin/search.server start

echo "runbot: FINISHED: Crawl completed!"
echo ""

-----Script-----------------------------------------------------------------------------

all data is fetched to hadoop temporary directory "hadoop-root" to the
/home/crawl/hadoop-root
and after this step data is moving from /home/ctawl/hadoop-root to
/home/ctawl/segments/xxxxxxxxxxx
and this step taking  a lot of time and depend on size it can take a week or
more
on this step data is moving with wery low speed 500 kb/ps (sorry i dont know
what it is doing on this step, I'm just user and have no java programing or
hadoop experiance)

Is there any way to make this step faster?

Thanks





--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3834059.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

Posted by George <ad...@proservice.ge>.
fetching
http://xxxxxxx.xxx/1-/color-%E1%83%97%E1%83%94%E1%83%97%E1%83%A0%E1%83%98/make-Caterham/1/listings.html
fetch of http://xxxxxxx.xxx/index.php?cat=27&currentpage=5 failed with:
java.net.SocketTimeoutException: connect timed out
fetch of http://xxxxxxx.xxx/gwdict/index.php?a=term&d=1&t=179856 failed
with: java.net.SocketException: Connection reset
fetch of http://wxxxxxxx.xxx/gwdict/index.php?a=term&d=1&t=72596 failed
with: java.net.SocketException: Connection reset
Error parsing:
http://xxxxxxx.xxx/pt/phpThumb.php?src=../movies/screens/mov_63_59472.jpg&w=125&h=90&zc=1:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
contentTyp
e=image/jpeg
url=http://xxxxxxx.xxx/pt/phpThumb.php?src=../movies/screens/mov_63_59472.jpg&w=125&h=90&zc=1
Error parsing: http://xxxxxxx.xxx/albumimage-159: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=image/png url=http://xxxxxxx.xxx/album
image-159
Error parsing:
http://xxxxxxx.xxx/show_image_trnsArchive.php?filename=/2010/09/source1238.jpg&cat=1&pid=26373&cache=true:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
co
ntentType=image/jpeg
url=http://xxxxxxx.xxx/show_image_trnsArchive.php?filename=/2010/09/source1238.jpg&cat=1&pid=26373&cache=true

*<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< this step takes a long time
about 4-5 days (this is 90 Gb fetched data)*

Fetcher: done
CrawlDb update: starting
CrawlDb update: db: /home/crawl/crawldb
CrawlDb update: segments: [/home/crawl/segments/20120321210937]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
----- Merge Segments (Step 3 of 8) -----
----- Invert Links (Step 4 of 8) -----
LinkDb: starting
LinkDb: linkdb: /home/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: /home/crawl/segments/20120316115027            
<<<<< 99 MB
LinkDb: adding segment: /home/crawl/segments/20120316121103            
<<<<< 3 GB
LinkDb: adding segment: /home/crawl/segments/20120316193047            
<<<<< 20 GB
LinkDb: adding segment: /home/crawl/segments/20120317162508            
<<<<< 20 GB
LinkDb: adding segment: /home/crawl/segments/20120321210937            
<<<<< 90 GB
LinkDb: merging with existing linkdb: /home/crawl/linkdb	        <<<<< *and
this step also*


--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3860920.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

Posted by Mathijs Homminga <ma...@kalooga.com>.
Hi George,

Just to be sure: 
Your crawl cycle includes a 'generate', 'fetch' and 'update' step. Is it indeed within the 'fetch' step that this issue occurs? 
So, _after_ the Fetcher logs the message "Fetcher: starting" and _before_ the Fetcher logs the message "Fetcher: done"?

If so, it indeed looks like Hadoop is moving your temporary data at very low rates.

Mathijs



On 19 mrt. 2012, at 03:20, George wrote:

> You are rght I'm using Nutch 0.9
> Thank you for sugestion but i need help with this version. 
> Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
> disks.
> Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
> There is no hardware problem.
> 
> May be I have not configured something or my fetching script doing that (I
> have not found such function in it) don't know.
> I just need to know why fetched data  is going to temporary directory and
> then is moved to the segment at wery low speed.
> 
> My >> hadoop-site.xml
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> 
> 
> <configuration>
> <property>
>  <name>hadoop.tmp.dir</name>
>    <value>/home/crawl/hadoop-${user.name}</value>
>      <description>Hadoop temp directory</description>
>      </property>
> </configuration>
> 
> Thanks
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fetching/Indexing process is taking a lot of time

Posted by George <ad...@proservice.ge>.
- The Fetch job has a mapper which does the fetching, and has a reducer which
copies the fetched data to de segment dir. Is it this step where you see the
problem?

yes

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3841307.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

Posted by Mathijs Homminga <ma...@kalooga.com>.
Which version of Hadoop are you using?

In your script, I see that you have a section called "---- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
At which of these sub steps do you see your problem?

For example: (from the top of my head) 
- The Fetch job has a mapper which does the fetching, and has a reducer which copies the fetched data to de segment dir. Is it this step where you see the problem?
- The Update job creates a new crawldb and then moves it to the final destination.
 
Mathijs

On Mar 19, 2012, at 3:20 , George wrote:

> You are rght I'm using Nutch 0.9
> Thank you for sugestion but i need help with this version. 
> Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
> disks.
> Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
> There is no hardware problem.
> 
> May be I have not configured something or my fetching script doing that (I
> have not found such function in it) don't know.
> I just need to know why fetched data  is going to temporary directory and
> then is moved to the segment at wery low speed.
> 
> My >> hadoop-site.xml
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> 
> 
> <configuration>
> <property>
>  <name>hadoop.tmp.dir</name>
>    <value>/home/crawl/hadoop-${user.name}</value>
>      <description>Hadoop temp directory</description>
>      </property>
> </configuration>
> 
> Thanks
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fetching/Indexing process is taking a lot of time

Posted by George <ad...@proservice.ge>.
You are rght I'm using Nutch 0.9
Thank you for sugestion but i need help with this version. 
Yes, as i say i have hardware (with BBU+256 Mb cache) Raid1 from 6 sata 7200
disks.
Copy speed on same disk is pretty hight and it's about 130-140 Mb/ps.
There is no hardware problem.

May be I have not configured something or my fetching script doing that (I
have not found such function in it) don't know.
I just need to know why fetched data  is going to temporary directory and
then is moved to the segment at wery low speed.

My >> hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
  <name>hadoop.tmp.dir</name>
    <value>/home/crawl/hadoop-${user.name}</value>
      <description>Hadoop temp directory</description>
      </property>
</configuration>

Thanks

 

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3837989.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

Posted by Mathijs Homminga <ma...@kalooga.com>.
Hmmm...
First, you say that you use Nutch 9.0, you probably mean Nutch 0.9. That version is almost 5 years old. I really suggest that you update to 1.4.
What if you manually move such amounts of data on your disks? Same low speed? (btw, do you really have raid 1 (mirroring) on 6 disks?)

Cheers,
Mathijs

On 17 mrt. 2012, at 20:59, George wrote:

> no
> 
> for example if i run dept 3 
> 
> it fetching  data to  hadoop temporary  directory then  moving data to new
> segment
> and do this cycles 3 times
> 
> all data is fetched to dadoop-root (temporary hadoop directory)
> and then nutch is moving this data to the segment  dir in segment folder.
> and for example moving data is taking:
> 
> first fetch is in about 3 gb moving in 0.30-2 hours
> second becomes 10-15 Gb and moving takes 10-12 hours
> third cycle become 20-25 Gb and moving takes 5-7 days may be more on more
> depts.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fetching/Indexing process is taking a lot of time

Posted by George <ad...@proservice.ge>.
no

for example if i run dept 3 

it fetching  data to  hadoop temporary  directory then  moving data to new
segment
and do this cycles 3 times

all data is fetched to dadoop-root (temporary hadoop directory)
and then nutch is moving this data to the segment  dir in segment folder.
and for example moving data is taking:

first fetch is in about 3 gb moving in 0.30-2 hours
second becomes 10-15 Gb and moving takes 10-12 hours
third cycle become 20-25 Gb and moving takes 5-7 days may be more on more
depts.

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3835186.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetching/Indexing process is taking a lot of time

Posted by Mathijs Homminga <ma...@kalooga.com>.
Hi,

Your hardware looks okay.

Moving data from 30,000 urls takes a week at 500kb/s?
That would mean ~10Mb per url. Could that be right?

Anyway, can you tell us at what stage your crawl script is when this kicks in?

Mathijs


On 17 mrt. 2012, at 07:40, George wrote:

> Hello
> 
> I.m using nutch 9.0 default installation single machine:
> 2x2.5 quad core
> 16 GB ram
> 6 x 1TB sata raid 1
> Network 1 gbps.
> Not using any distributed file system.
> 
> Of cource have it configured
> All headers
> Threads : 100
> 
> Trying to crawl 30000 url-s with generate per site -1
> 
> fetching with :
> 
> --Script-------------------------------------------------------------------------------
> 
> #!/bin/bash
> 
> # runbot script to run the Nutch bot for crawling and re-crawling.
> # Usage: bin/runbot [safe]
> #        If executed in 'safe' mode, it doesn't delete the temporary
> #        directories generated during crawl. This might be helpful for
> #        analysis and recovery in case a crawl fails.
> #
> # Author: Susam Pal
> 
> # LOCAL VARIABLES
> cd /usr/local/nutch
> 
>    export JAVA_HOME=/usr/local/java
>    export NUTCH_JAVA_HOME=/usr/local/java
> 
>    export NUTCH_HEAPSIZE=2048
> 
>    NUTCH_HOME=/usr/local/nutch
> 
> #    if [ -e $NUTCH_HOME/nutch.tmp ]
> #        then
> #        echo "Index process found..."
> #    else
> #        date >> $NUTCH_HOME/nutch.tmp
> 
> 
> depth=1
> threads=100
> adddays=30
> topN=1000000 #Comment this statement if you don't want to set topN value
> 
> # Arguments for rm and mv
> RMARGS="-rf"
> MVARGS="-v"
> 
> # Parse arguments
> if [ "$1" == "safe" ]
> then
>  safe=yes
> fi
> 
> if [ -z "$NUTCH_HOME" ]
> then
>  NUTCH_HOME=.
>  echo runbot: $0 could not find environment variable NUTCH_HOME
>  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
> else
>  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
> fi
> 
> if [ -n "$topN" ]
> then
>  topN="-topN $topN"
> else
>  topN=""
> fi
> 
> steps=8
> echo "----- Inject (Step 1 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch inject /home/crawl/crawldb urls
> 
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> 
> for ((i=0; i <= depth ; i++))
> do
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>  /bin/bash $NUTCH_HOME/bin/nutch generate /home/crawl/crawldb
> /home/crawl/segments $topN \
>      -adddays $adddays
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>    break
>  fi
>  segment=`ls -d /home/crawl/segments/* | tail -1`
> 
>  /bin/bash $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>    echo "runbot: Deleting segment $segment."
>    rm $RMARGS $segment
>    continue
>  fi
> 
>  /bin/bash $NUTCH_HOME/bin/nutch updatedb /home/crawl/crawldb $segment
> done
> 
> echo "----- Merge Segments (Step 3 of $steps) -----"
> #/bin/bash $NUTCH_HOME/bin/nutch mergesegs /home/crawl/MERGEDsegments
> /home/crawl/segments/*
> #if [ "$safe" != "yes" ]
> #then
> #  rm $RMARGS /home/crawl/segments
> #else
> #  rm $RMARGS /home/crawl/BACKUPsegments
> #  mv $MVARGS /home/crawl/segments /home/crawl/BACKUPsegments
> #fi
> 
> #mv $MVARGS /home/crawl/MERGEDsegments /home/crawl/segments
> 
> echo "----- Invert Links (Step 4 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch invertlinks /home/crawl/linkdb
> /home/crawl/segments/*
> 
> echo "----- Index (Step 5 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch index /home/crawl/NEWindexes
> /home/crawl/crawldb /home/crawl/linkdb \
>    /home/crawl/segments/*
> 
> echo "----- Dedup (Step 6 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch dedup /home/crawl/NEWindexes
> 
> echo "----- Merge Indexes (Step 7 of $steps) -----"
> /bin/bash $NUTCH_HOME/bin/nutch merge /home/crawl/NEWindex
> /home/crawl/NEWindexes
> 
> echo "----- Loading New Index (Step 8 of $steps) -----"
> 
> if [ "$safe" != "yes" ]
> then
>  rm $RMARGS /home/crawl/NEWindexes
>  rm $RMARGS /home/crawl/index
> else
>  rm $RMARGS /home/crawl/BACKUPindexes
>  rm $RMARGS /home/crawl/BACKUPindex
>  mv $MVARGS /home/crawl/NEWindexes /home/crawl/BACKUPindexes
>  mv $MVARGS /home/crawl/index /home/crawl/BACKUPindex
> fi
> 
> mv $MVARGS /home/crawl/NEWindex /home/crawl/index
> 
> #rm -f ${NUTCH_HOME}/nutch.tmp
> 
>    /bin/bash $NUTCH_HOME/bin/nutch readdb /home/crawl/crawldb -stats 1
> 
>    /bin/bash $NUTCH_HOME/bin/search.server stop
>    /bin/bash $NUTCH_HOME/bin/search.server start
> 
> echo "runbot: FINISHED: Crawl completed!"
> echo ""
> 
> -----Script-----------------------------------------------------------------------------
> 
> all data is fetched to hadoop temporary directory "hadoop-root" to the
> /home/crawl/hadoop-root
> and after this step data is moving from /home/ctawl/hadoop-root to
> /home/ctawl/segments/xxxxxxxxxxx
> and this step taking  a lot of time and depend on size it can take a week or
> more
> on this step data is moving with wery low speed 500 kb/ps (sorry i dont know
> what it is doing on this step, I'm just user and have no java programing or
> hadoop experiance)
> 
> Is there any way to make this step faster?
> 
> Thanks
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fetching-Indexing-process-is-taking-a-lot-of-time-tp3834059p3834059.html
> Sent from the Nutch - User mailing list archive at Nabble.com.