You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Eggebrecht, Thomas (GfK Marktforschung)" <th...@gfk.com> on 2011/07/08 14:53:56 UTC

Partitioning selected urls for politeness and scoring

Hi list,

My seed list contains URLs from about 20 different domains. In the first fetch cycles everything is all right and all domains will be selected quite equally distributed. But after about 10-15 cycles one domain starts to prevail. URLs from all other domains will not be selected anymore. It seems that URLs from that certain domain have the highest scoring and URLs from other domains don't have a chance anymore. Is this a right assumption?

I'm not very happy because I would like to fetch URLs from all domains in each cycle. What would you do in that case?

Best regards and thanks for answers
Thomas

(Using nutch-1.2)


GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.

Re: Partitioning selected urls for politeness and scoring

Posted by Thomas Eggebrecht <th...@googlemail.com>.
Original fetch interval: What do you mean? The script starts once a week (of
course only if it is not running). The fetch cycle takes 1-3 days depending
on -topN and -depth. If you mean the attribute "next fetch time" on each
URLs I didn't change anything - I think 30 days by default.

The high scoring was just an assumption from me. I didn't have a look at the
score values so far.

All my target sites are quite big and should contain more than 1 million
URLs (i.e. www.motor-talk.de). So I tried big -topN (20,000) and big -depth
(20), but only one domain has always been selected and the runtime increases
up to five days.

I blocked that predominant domain (BTW this is www.carmondo.de) in
regex-urlfilter.txt, but another domain moved up and became predominant (
www.motor-talk.de).

Since I'm using Nutch-1.2 I implemented NutchBean for searching. Each search
contains results well distributed from all domains. So I think the index
should be Ok.

Since everything seems to be Ok, I'll now run the script with big -topN and
-depth and hope, that this behaviour will change some day - maybe after all
URLs from the predominant domain are fetched. I'll let you know.

2011/7/11 lewis john mcgibbney <le...@gmail.com>

> What was the original fetch interval between successive crawls?
>
> Yopu're script looks fine and this would also shadown the fact that
> crawlking does not seem to be a problem. You mentioned that the domain
> which
> is being fecthed more than others seems to recieve a higher scoring count
> than other sites, how did you acertain this? I know that this is a simple
> suggestion, but could it possibly be the case that -topN = 500 exceeds the
> number of pages in the domains which are not being fetched at subsequent
> recrawls?
> [...]
>

Re: Partitioning selected urls for politeness and scoring

Posted by lewis john mcgibbney <le...@gmail.com>.
What was the original fetch interval between successive crawls?

Yopu're script looks fine and this would also shadown the fact that
crawlking does not seem to be a problem. You mentioned that the domain which
is being fecthed more than others seems to recieve a higher scoring count
than other sites, how did you acertain this? I know that this is a simple
suggestion, but could it possibly be the case that -topN = 500 exceeds the
number of pages in the domains which are not being fetched at subsequent
recrawls?

On Mon, Jul 11, 2011 at 2:14 PM, Thomas Eggebrecht <
thomas.eggebrecht@googlemail.com> wrote:

> Hi Lewis,
> No, I don't use the crawl command. I use an adapted step-by-step script
> from
> the Wiki and Nutch is locally running on a single server. The attached
> script is without merging and indexing, what is a separate step in my
> workflow. My (fetch-)workflow is:
> - inject
> - generate
> - fetch
> - updatedb
>
> Please see my complete (fetch-)script:
> #!/bin/sh
> RUN_HOME=/home/tsegge
> crawl=$RUN_HOME/crawl
> nutch=$RUN_HOME/nutch-1.2/bin/nutch
> urls=$RUN_HOME/urls/seed.txt
>
> depth=6
> topN=500
> threads=10
> adddays=30
>
> echo "----- Inject (Step 1) -----"
> $nutch inject $crawl/crawldb $urls
> echo "----- Generate, Fetch, Parse, Update (Step 2) -----"
> for((i=0; i < $depth; i++))
> do
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>  $nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays
> $adddays
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>    break
>  fi
>  segment=`ls -d $crawl/segments/* | tail -1`
>  echo "--- fetch at depth `expr $i + 1` of $depth ---"
>  $nutch fetch $segment -threads $threads
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
>    rm -rf $segment
>    continue
>  fi
>  echo "--- updatedb at depth `expr $i + 1` of $depth ---"
>  $nutch updatedb $crawl/crawldb $segment
> done
>
>
> Kind regards
> Thomas Eggebrecht
>
>
> 2011/7/8 lewis john mcgibbney <le...@gmail.com>
>
> > [...]
> > Can you explain more about your crawling operation? Are you executing a
> > crawl command? If so what arguements are you passing?
> >
> > If not can you give more detail of the job you are running
> > [...]
>
> --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Partitioning selected urls for politeness and scoring

Posted by Thomas Eggebrecht <th...@googlemail.com>.
Hi Lewis,
No, I don't use the crawl command. I use an adapted step-by-step script from
the Wiki and Nutch is locally running on a single server. The attached
script is without merging and indexing, what is a separate step in my
workflow. My (fetch-)workflow is:
- inject
- generate
- fetch
- updatedb

Please see my complete (fetch-)script:
#!/bin/sh
RUN_HOME=/home/tsegge
crawl=$RUN_HOME/crawl
nutch=$RUN_HOME/nutch-1.2/bin/nutch
urls=$RUN_HOME/urls/seed.txt

depth=6
topN=500
threads=10
adddays=30

echo "----- Inject (Step 1) -----"
$nutch inject $crawl/crawldb $urls
echo "----- Generate, Fetch, Parse, Update (Step 2) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays
$adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`
  echo "--- fetch at depth `expr $i + 1` of $depth ---"
  $nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi
  echo "--- updatedb at depth `expr $i + 1` of $depth ---"
  $nutch updatedb $crawl/crawldb $segment
done


Kind regards
Thomas Eggebrecht


2011/7/8 lewis john mcgibbney <le...@gmail.com>

> [...]
> Can you explain more about your crawling operation? Are you executing a
> crawl command? If so what arguements are you passing?
>
> If not can you give more detail of the job you are running
> [...]

--
> *Lewis*
>

Re: Partitioning selected urls for politeness and scoring

Posted by lewis john mcgibbney <le...@gmail.com>.
Yes this would limit the number of URLs from any one domain, but it would
not explain why one domain seems to get fetched more after recursive fetches
of some given seed set.

Can you explain more about your crawling operation? Are you executing a
crawl command? If so what arguements are you passing?

If not can you give more detail of the job you are running

Thank you

On Fri, Jul 8, 2011 at 2:50 PM, Hannes Carl Meyer <hannescarl@googlemail.com
> wrote:

> Hi,
>
> you could set generate.max.per.host to a reasonable size to prevent this!
> On a default configuration this is set to -1 which means unlimited.
>
> BR
>
> Hannes
>
> ---
> Hannes Carl Meyer
> www.informera.de
>
> On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) <
> thomas.eggebrecht@gfk.com> wrote:
>
> > Hi list,
> >
> > My seed list contains URLs from about 20 different domains. In the first
> > fetch cycles everything is all right and all domains will be selected
> quite
> > equally distributed. But after about 10-15 cycles one domain starts to
> > prevail. URLs from all other domains will not be selected anymore. It
> seems
> > that URLs from that certain domain have the highest scoring and URLs from
> > other domains don't have a chance anymore. Is this a right assumption?
> >
> > I'm not very happy because I would like to fetch URLs from all domains in
> > each cycle. What would you do in that case?
> >
> > Best regards and thanks for answers
> > Thomas
> >
> > (Using nutch-1.2)
> >
> >
> > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
> Wilhelm
> > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> > This email and any attachments may contain confidential or privileged
> > information. Please note that unauthorized copying, disclosure or
> > distribution of the material in this email is not permitted.
> >
>



-- 
*Lewis*

Re: Partitioning selected urls for politeness and scoring

Posted by Hannes Carl Meyer <ha...@googlemail.com>.
Hi,

you could set generate.max.per.host to a reasonable size to prevent this!
On a default configuration this is set to -1 which means unlimited.

BR

Hannes

---
Hannes Carl Meyer
www.informera.de

On Fri, Jul 8, 2011 at 2:53 PM, Eggebrecht, Thomas (GfK Marktforschung) <
thomas.eggebrecht@gfk.com> wrote:

> Hi list,
>
> My seed list contains URLs from about 20 different domains. In the first
> fetch cycles everything is all right and all domains will be selected quite
> equally distributed. But after about 10-15 cycles one domain starts to
> prevail. URLs from all other domains will not be selected anymore. It seems
> that URLs from that certain domain have the highest scoring and URLs from
> other domains don't have a chance anymore. Is this a right assumption?
>
> I'm not very happy because I would like to fetch URLs from all domains in
> each cycle. What would you do in that case?
>
> Best regards and thanks for answers
> Thomas
>
> (Using nutch-1.2)
>
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>