You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tolga <to...@ozses.net> on 2012/05/24 09:17:39 UTC

Large website not fully crawled

Hi,

I am crawling a large website, which is our university's. From the logs 
and some grep'ing, I see that some pdf files were not crawled. Why could 
this happen? I'm crawling with -depth 100 -topN 5.

Regards,

Re: Large website not fully crawled

Posted by Piet van Remortel <pi...@gmail.com>.

that could be it indeed

I googled it for you, first hit searching for "nutch crawl query pages"

http://stackoverflow.com/questions/7045716/nutch-1-2-why-wont-nutch-crawl-url-with-query-strings


On Thu, May 24, 2012 at 1:52 PM, Tolga <to...@ozses.net> wrote:

> I might have figured out why. Our website has a lot of query strings in
> addresses. One example is http://www.sabanciuniv.edu/**
> eng/?genel_bilgi/yonetim/**yonetim_kapak/yonetim_kapak.**html<http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html>.
> Could this be why? If that's the case, how do I crawl it?
>
> Regards,
>
>
> On 5/24/12 11:28 AM, Piet van Remortel wrote:
>
>> I googled for you:
>>
>> "Typically one starts testing one’s configuration by crawling at shallow
>> depths, sharply limiting the number of pages fetched at each level
>> (-topN),
>> and watching the output to check that desired pages are fetched and
>> undesirable pages are not. Once one is confident of the configuration,
>> then
>> an appropriate depth for a full crawl is around 10. The number of pages
>> per
>> level (-topN) for a full crawl can be from tens of thousands to millions,
>> depending on your resources."
>>
>> Also, as the nutch documentation shows, the topN parameter is optional.
>>
>> Can I respectfully suggest that you go through the basic information that
>> is available online to get familiar with Nutch.  Copying the online
>> information into this mailing list is not helping anybody.
>>
>>
>> On Thu, May 24, 2012 at 10:19 AM, Tolga<to...@ozses.net>  wrote:
>>
>>
>>> On 5/24/12 11:00 AM, Piet van Remortel wrote:
>>>
>>>  On Thu, May 24, 2012 at 9:35 AM, Tolga<to...@ozses.net>   wrote:
>>>>
>>>>  - I don't fully understand the use of topN parameter. Should I increase
>>>>
>>>>> it?
>>>>>
>>>>>  yes
>>>>>
>>>> What would a sensible topN value be a for a large university website?
>>>
>>>
>>>>  - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>>>
>>>>>  good, should work then
>>>>>
>>>>
>>>>  - I looked for the link, it was there. Besides, that was for another
>>>>
>>>>> website I was experimenting on.
>>>>> - How do I check segments?
>>>>>
>>>>>  e.g. with segmentreader, a hadoop access command built in nutch
>>>>>
>>>>
>>>>  - I didn't check filenames, but I've tried searching for a word in that
>>>>
>>>>> PDF file.
>>>>>
>>>>>  then the reason could also be indexing
>>>>>
>>>>
>>>>  - I've got more than 50gb free.
>>>>
>>>>> - I'm not sure about webserver kicking me off, I'll have the check that
>>>>> with the sysadmin.
>>>>>
>>>>>  should be visible as something like timeouts or a similar message in
>>>>> the
>>>>>
>>>> hadoop logs
>>>>
>>>>
>>>>  Regards,
>>>>
>>>>>
>>>>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>>>>
>>>>>  - your topN parameter limited the crawl : see the info at
>>>>>
>>>>>> http://wiki.apache.org/nutch/******NutchTutorial<http://wiki.apache.org/nutch/****NutchTutorial>
>>>>>> <http://wiki.**apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
>>>>>> >
>>>>>> <http://wiki.**apache.org/**nutch/NutchTutorial<http://apache.org/nutch/NutchTutorial>
>>>>>> <http://**wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>>> >
>>>>>> **>
>>>>>>
>>>>>>
>>>>>>
>>>>>> or :
>>>>>>
>>>>>> - file filters
>>>>>> - there is no link to the files (as you suggested yourself already)
>>>>>> - did you check the correct/all segments ?
>>>>>> - did you check the fully correct filenames ? wildcards don't work on
>>>>>> all
>>>>>> segmentreader approaches
>>>>>> - size limits of the crawler (see previous discussion)
>>>>>> - did you check file presence in the segment, or parse result ?  i.e.
>>>>>> parsing could have failed (cfr the previous discussion of the last few
>>>>>> days)
>>>>>> - your disk got full and crawling stopped
>>>>>> - the webserver(s) kicked you off
>>>>>> - your hadoop logs have overrun the local disk on which the crawler
>>>>>> was
>>>>>> running (i.e. disk full)
>>>>>>
>>>>>> Piet
>>>>>>
>>>>>>
>>>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>    wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>  I am crawling a large website, which is our university's. From the
>>>>>>> logs
>>>>>>> and some grep'ing, I see that some pdf files were not crawled. Why
>>>>>>> could
>>>>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Large website not fully crawled

Posted by Tolga <to...@ozses.net>.

I might have figured out why. Our website has a lot of query strings in 
addresses. One example is 
http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html. 
Could this be why? If that's the case, how do I crawl it?

Regards,

On 5/24/12 11:28 AM, Piet van Remortel wrote:
> I googled for you:
>
> "Typically one starts testing one’s configuration by crawling at shallow
> depths, sharply limiting the number of pages fetched at each level (-topN),
> and watching the output to check that desired pages are fetched and
> undesirable pages are not. Once one is confident of the configuration, then
> an appropriate depth for a full crawl is around 10. The number of pages per
> level (-topN) for a full crawl can be from tens of thousands to millions,
> depending on your resources."
>
> Also, as the nutch documentation shows, the topN parameter is optional.
>
> Can I respectfully suggest that you go through the basic information that
> is available online to get familiar with Nutch.  Copying the online
> information into this mailing list is not helping anybody.
>
>
> On Thu, May 24, 2012 at 10:19 AM, Tolga<to...@ozses.net>  wrote:
>
>>
>> On 5/24/12 11:00 AM, Piet van Remortel wrote:
>>
>>> On Thu, May 24, 2012 at 9:35 AM, Tolga<to...@ozses.net>   wrote:
>>>
>>>   - I don't fully understand the use of topN parameter. Should I increase
>>>> it?
>>>>
>>>>   yes
>> What would a sensible topN value be a for a large university website?
>>
>>>
>>>   - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>>>   good, should work then
>>>
>>>   - I looked for the link, it was there. Besides, that was for another
>>>> website I was experimenting on.
>>>> - How do I check segments?
>>>>
>>>>   e.g. with segmentreader, a hadoop access command built in nutch
>>>
>>>   - I didn't check filenames, but I've tried searching for a word in that
>>>> PDF file.
>>>>
>>>>   then the reason could also be indexing
>>>
>>>   - I've got more than 50gb free.
>>>> - I'm not sure about webserver kicking me off, I'll have the check that
>>>> with the sysadmin.
>>>>
>>>>   should be visible as something like timeouts or a similar message in the
>>> hadoop logs
>>>
>>>
>>>   Regards,
>>>>
>>>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>>>
>>>>   - your topN parameter limited the crawl : see the info at
>>>>> http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
>>>>> <http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>> **>
>>>>>
>>>>>
>>>>> or :
>>>>>
>>>>> - file filters
>>>>> - there is no link to the files (as you suggested yourself already)
>>>>> - did you check the correct/all segments ?
>>>>> - did you check the fully correct filenames ? wildcards don't work on
>>>>> all
>>>>> segmentreader approaches
>>>>> - size limits of the crawler (see previous discussion)
>>>>> - did you check file presence in the segment, or parse result ?  i.e.
>>>>> parsing could have failed (cfr the previous discussion of the last few
>>>>> days)
>>>>> - your disk got full and crawling stopped
>>>>> - the webserver(s) kicked you off
>>>>> - your hadoop logs have overrun the local disk on which the crawler was
>>>>> running (i.e. disk full)
>>>>>
>>>>> Piet
>>>>>
>>>>>
>>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>    wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> I am crawling a large website, which is our university's. From the logs
>>>>>> and some grep'ing, I see that some pdf files were not crawled. Why
>>>>>> could
>>>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>>

Re: Large website not fully crawled

Posted by Piet van Remortel <pi...@gmail.com>.

I googled for you:

"Typically one starts testing one’s configuration by crawling at shallow
depths, sharply limiting the number of pages fetched at each level (-topN),
and watching the output to check that desired pages are fetched and
undesirable pages are not. Once one is confident of the configuration, then
an appropriate depth for a full crawl is around 10. The number of pages per
level (-topN) for a full crawl can be from tens of thousands to millions,
depending on your resources."

Also, as the nutch documentation shows, the topN parameter is optional.

Can I respectfully suggest that you go through the basic information that
is available online to get familiar with Nutch.  Copying the online
information into this mailing list is not helping anybody.


On Thu, May 24, 2012 at 10:19 AM, Tolga <to...@ozses.net> wrote:

>
>
> On 5/24/12 11:00 AM, Piet van Remortel wrote:
>
>> On Thu, May 24, 2012 at 9:35 AM, Tolga<to...@ozses.net>  wrote:
>>
>>  - I don't fully understand the use of topN parameter. Should I increase
>>> it?
>>>
>>>  yes
>>
> What would a sensible topN value be a for a large university website?
>
>>
>>
>>  - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>>
>>>  good, should work then
>>
>>
>>  - I looked for the link, it was there. Besides, that was for another
>>> website I was experimenting on.
>>> - How do I check segments?
>>>
>>>  e.g. with segmentreader, a hadoop access command built in nutch
>>
>>
>>  - I didn't check filenames, but I've tried searching for a word in that
>>> PDF file.
>>>
>>>  then the reason could also be indexing
>>
>>
>>  - I've got more than 50gb free.
>>> - I'm not sure about webserver kicking me off, I'll have the check that
>>> with the sysadmin.
>>>
>>>  should be visible as something like timeouts or a similar message in the
>> hadoop logs
>>
>>
>>  Regards,
>>>
>>>
>>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>>
>>>  - your topN parameter limited the crawl : see the info at
>>>> http://wiki.apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial>
>>>> <http://wiki.**apache.org/nutch/NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>> **>
>>>>
>>>>
>>>> or :
>>>>
>>>> - file filters
>>>> - there is no link to the files (as you suggested yourself already)
>>>> - did you check the correct/all segments ?
>>>> - did you check the fully correct filenames ? wildcards don't work on
>>>> all
>>>> segmentreader approaches
>>>> - size limits of the crawler (see previous discussion)
>>>> - did you check file presence in the segment, or parse result ?  i.e.
>>>> parsing could have failed (cfr the previous discussion of the last few
>>>> days)
>>>> - your disk got full and crawling stopped
>>>> - the webserver(s) kicked you off
>>>> - your hadoop logs have overrun the local disk on which the crawler was
>>>> running (i.e. disk full)
>>>>
>>>> Piet
>>>>
>>>>
>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>   wrote:
>>>>
>>>>  Hi,
>>>>
>>>>> I am crawling a large website, which is our university's. From the logs
>>>>> and some grep'ing, I see that some pdf files were not crawled. Why
>>>>> could
>>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>

Re: Large website not fully crawled

Posted by Tolga <to...@ozses.net>.


On 5/24/12 11:00 AM, Piet van Remortel wrote:
> On Thu, May 24, 2012 at 9:35 AM, Tolga<to...@ozses.net>  wrote:
>
>> - I don't fully understand the use of topN parameter. Should I increase it?
>>
> yes
What would a sensible topN value be a for a large university website?
>
>
>> - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>>
> good, should work then
>
>
>> - I looked for the link, it was there. Besides, that was for another
>> website I was experimenting on.
>> - How do I check segments?
>>
> e.g. with segmentreader, a hadoop access command built in nutch
>
>
>> - I didn't check filenames, but I've tried searching for a word in that
>> PDF file.
>>
> then the reason could also be indexing
>
>
>> - I've got more than 50gb free.
>> - I'm not sure about webserver kicking me off, I'll have the check that
>> with the sysadmin.
>>
> should be visible as something like timeouts or a similar message in the
> hadoop logs
>
>
>> Regards,
>>
>>
>> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>>
>>> - your topN parameter limited the crawl : see the info at
>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>
>>> or :
>>>
>>> - file filters
>>> - there is no link to the files (as you suggested yourself already)
>>> - did you check the correct/all segments ?
>>> - did you check the fully correct filenames ? wildcards don't work on all
>>> segmentreader approaches
>>> - size limits of the crawler (see previous discussion)
>>> - did you check file presence in the segment, or parse result ?  i.e.
>>> parsing could have failed (cfr the previous discussion of the last few
>>> days)
>>> - your disk got full and crawling stopped
>>> - the webserver(s) kicked you off
>>> - your hadoop logs have overrun the local disk on which the crawler was
>>> running (i.e. disk full)
>>>
>>> Piet
>>>
>>>
>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>   wrote:
>>>
>>>   Hi,
>>>> I am crawling a large website, which is our university's. From the logs
>>>> and some grep'ing, I see that some pdf files were not crawled. Why could
>>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>>
>>>> Regards,
>>>>
>>>>

Re: Large website not fully crawled

Posted by Piet van Remortel <pi...@gmail.com>.

On Thu, May 24, 2012 at 9:35 AM, Tolga <to...@ozses.net> wrote:

> - I don't fully understand the use of topN parameter. Should I increase it?
>

yes



> - You mean parse-pdf thing? I've got that in my nutch-default.xml.
>

good, should work then


> - I looked for the link, it was there. Besides, that was for another
> website I was experimenting on.
> - How do I check segments?
>

e.g. with segmentreader, a hadoop access command built in nutch


> - I didn't check filenames, but I've tried searching for a word in that
> PDF file.
>

then the reason could also be indexing


> - I've got more than 50gb free.
> - I'm not sure about webserver kicking me off, I'll have the check that
> with the sysadmin.
>

should be visible as something like timeouts or a similar message in the
hadoop logs


>
> Regards,
>
>
> On 5/24/12 10:25 AM, Piet van Remortel wrote:
>
>> - your topN parameter limited the crawl : see the info at
>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>
>> or :
>>
>> - file filters
>> - there is no link to the files (as you suggested yourself already)
>> - did you check the correct/all segments ?
>> - did you check the fully correct filenames ? wildcards don't work on all
>> segmentreader approaches
>> - size limits of the crawler (see previous discussion)
>> - did you check file presence in the segment, or parse result ?  i.e.
>> parsing could have failed (cfr the previous discussion of the last few
>> days)
>> - your disk got full and crawling stopped
>> - the webserver(s) kicked you off
>> - your hadoop logs have overrun the local disk on which the crawler was
>> running (i.e. disk full)
>>
>> Piet
>>
>>
>> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>  wrote:
>>
>>  Hi,
>>>
>>> I am crawling a large website, which is our university's. From the logs
>>> and some grep'ing, I see that some pdf files were not crawled. Why could
>>> this happen? I'm crawling with -depth 100 -topN 5.
>>>
>>> Regards,
>>>
>>>

Re: Large website not fully crawled

Posted by Tolga <to...@ozses.net>.

- I don't fully understand the use of topN parameter. Should I increase it?
- You mean parse-pdf thing? I've got that in my nutch-default.xml.
- I looked for the link, it was there. Besides, that was for another 
website I was experimenting on.
- How do I check segments?
- I didn't check filenames, but I've tried searching for a word in that 
PDF file.
- I've got more than 50gb free.
- I'm not sure about webserver kicking me off, I'll have the check that 
with the sysadmin.

Regards,

On 5/24/12 10:25 AM, Piet van Remortel wrote:
> - your topN parameter limited the crawl : see the info at
> http://wiki.apache.org/nutch/NutchTutorial
>
> or :
>
> - file filters
> - there is no link to the files (as you suggested yourself already)
> - did you check the correct/all segments ?
> - did you check the fully correct filenames ? wildcards don't work on all
> segmentreader approaches
> - size limits of the crawler (see previous discussion)
> - did you check file presence in the segment, or parse result ?  i.e.
> parsing could have failed (cfr the previous discussion of the last few days)
> - your disk got full and crawling stopped
> - the webserver(s) kicked you off
> - your hadoop logs have overrun the local disk on which the crawler was
> running (i.e. disk full)
>
> Piet
>
>
> On Thu, May 24, 2012 at 9:17 AM, Tolga<to...@ozses.net>  wrote:
>
>> Hi,
>>
>> I am crawling a large website, which is our university's. From the logs
>> and some grep'ing, I see that some pdf files were not crawled. Why could
>> this happen? I'm crawling with -depth 100 -topN 5.
>>
>> Regards,
>>

Re: Large website not fully crawled

Posted by Piet van Remortel <pi...@gmail.com>.

- your topN parameter limited the crawl : see the info at
http://wiki.apache.org/nutch/NutchTutorial

or :

- file filters
- there is no link to the files (as you suggested yourself already)
- did you check the correct/all segments ?
- did you check the fully correct filenames ? wildcards don't work on all
segmentreader approaches
- size limits of the crawler (see previous discussion)
- did you check file presence in the segment, or parse result ?  i.e.
parsing could have failed (cfr the previous discussion of the last few days)
- your disk got full and crawling stopped
- the webserver(s) kicked you off
- your hadoop logs have overrun the local disk on which the crawler was
running (i.e. disk full)

Piet


On Thu, May 24, 2012 at 9:17 AM, Tolga <to...@ozses.net> wrote:

> Hi,
>
> I am crawling a large website, which is our university's. From the logs
> and some grep'ing, I see that some pdf files were not crawled. Why could
> this happen? I'm crawling with -depth 100 -topN 5.
>
> Regards,
>