You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2014/09/22 12:31:36 UTC

[RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Three +1's, >72 hours.  Vote passes!

Karl

On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

>
> I'm able to fetch documents from www.duo.uio.no using file-based
> synchronization, so there are no network problems.
>
> Anyway, I'll continue to test RC2. Even though I'm not able to use
> Zookeeper-based synchronization on that host, I may find other
> bugs/problems.
>
> Erlend
>
>
> On 22.09.14 10:39, Erlend Garåsen wrote:
>
>>
>> I can verify an eventually network problem by using file-based
>> synchronization instead.
>>
>> I'll do that right away and test RC2 as well, even though you already
>> have three +1's.
>>
>> The three other jobs I started before I left my office on Thursday did
>> all complete successfully.
>>
>> Erlend
>>
>> On 19.09.14 12:27, Karl Wright wrote:
>>
>>> Well, it's crawled fine over night, with no issues whatsoever.  I'm
>>> using a
>>> Zookeeper setup, with MCF 1.7.1 RC1.
>>>
>>> I still maintain you've got something broken with the network in your
>>> production machine.
>>>
>>> Karl
>>>
>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>  Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
>>>> e.f.garasen@usit.uio.no> wrote:
>>>>
>>>>  I know. I used a lot of time to create the rules which seems to index
>>>>> what we really want. Your observation is correct. Crawling Dspace
>>>>> repositories are very difficult. A lot of nonsense pages we need to
>>>>> filter
>>>>> out.
>>>>>
>>>>> We have crawled this host the last two years using file based synch.
>>>>>
>>>>> I'm planning a new approach, i.e. using a connector etc.
>>>>>
>>>>> E
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>  On 18. sep. 2014, at 22:35, "Karl Wright" <da...@gmail.com> wrote:
>>>>>>
>>>>>> Ok, I started this crawl.  It fetched and processed robots.txt
>>>>>>
>>>>> perfectly.
>>>>>
>>>>>> And then I saw the following: lots of fetches of fairly good-sized
>>>>>> documents, with very few ingestions.  The documents that did not
>>>>>> ingest
>>>>>> look like this:
>>>>>>
>>>>>>
>>>>>>  https://www.duo.uio.no/handle/10852/163/discover?order=DESC&
>>>>> r...pp=100&sort_by=dc.date.issued_dt
>>>>>
>>>>>
>>>>>>
>>>>>> I think your index inclusion rules may be excluding most of the
>>>>>> content.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <da...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>>
>>>>>>> Thanks -- I will probably not be able to get to this further until
>>>>>>>
>>>>>> tonight
>>>>>
>>>>>> anyhow.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>>>>>>>
>>>>>> e.f.garasen@usit.uio.no>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> I tried to fetch documents by using curl from our prod server
>>>>>>>> just in
>>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> webmaster of that host anyway, just to be sure.
>>>>>>>>
>>>>>>>> The interrupted message may have been caused by an abort of that
>>>>>>>> job.
>>>>>>>>
>>>>>>>> I think I should just stop the problematic job and start all the
>>>>>>>> other
>>>>>>>> three remaining jobs instead. I bet they will all complete.
>>>>>>>> Ideally we
>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace
>>>>>>>> resource. I
>>>>>>>> have just contacted someone who is indexing Dspace resources. I
>>>>>>>> guess
>>>>>>>>
>>>>>>> a
>>>>>
>>>>>> Dspace connector is a better approach.
>>>>>>>>
>>>>>>>> Below you'll find some parameters.
>>>>>>>>
>>>>>>>> REPOSITORY CONNECTION
>>>>>>>> ---------------------
>>>>>>>> Throttling -> max connections: 30
>>>>>>>> Throttling -> Max fetches/min: 100
>>>>>>>> Bandwith -> max connections: 25
>>>>>>>> Bandwith -> max kbytes/sec: 8000
>>>>>>>> Bandwith -> max fetches/min: 20
>>>>>>>>
>>>>>>>> JOB SETTINGS
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> Hop filters: Keep forever
>>>>>>>>
>>>>>>>> Seeds: https://www.duo.uio.no/
>>>>>>>>
>>>>>>>> Exclude from crawl:
>>>>>>>> # Exclude some file types:
>>>>>>>> \.gif$
>>>>>>>> \.GIF$
>>>>>>>> \.jpeg$
>>>>>>>> \.JPEG$
>>>>>>>> \.jpg$
>>>>>>>> \.JPG$
>>>>>>>> \.png$
>>>>>>>> \.PNG$
>>>>>>>> \.mpg$
>>>>>>>> \.MPG$
>>>>>>>> \.mpeg$
>>>>>>>> \.MPEG$
>>>>>>>> \.exe$
>>>>>>>> \.bmp$
>>>>>>>> \.BMP$
>>>>>>>> \.mov$
>>>>>>>> \.MOV$
>>>>>>>> \.wmf$
>>>>>>>> \.css$
>>>>>>>> \.ico$
>>>>>>>> \.ICO$
>>>>>>>> \.mp2$
>>>>>>>> \.mp3$
>>>>>>>> \.mp4$
>>>>>>>> \.wmv$
>>>>>>>> \.tif$
>>>>>>>> \.tiff$
>>>>>>>> \.avi$
>>>>>>>> \.ogg$
>>>>>>>> \.ogv$
>>>>>>>> \.zip$
>>>>>>>> \.gz$
>>>>>>>> \.psd$
>>>>>>>>
>>>>>>>> # TIKA-1011
>>>>>>>> \.mhtml$
>>>>>>>>
>>>>>>>> # Exclude log files:
>>>>>>>> \.log$
>>>>>>>> \.logfile$
>>>>>>>>
>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>>>>>>> https?://www\.duo\.uio\.no/sok/search.*
>>>>>>>>
>>>>>>>> # Andre elementer i DUO som skal ekskluderes:
>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>>>>>>> advanced-search|community-list|browse|password-login|
>>>>>>>> inn|discover).*
>>>>>>>>
>>>>>>>> # Skip locale settings - makes duplicates:
>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>>>>>>>
>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts:
>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>>>>>>>
>>>>>>>> # skip full item record:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>>>>>>> # ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>>>>>>>
>>>>>>>> # Skip all navigations but "start with letter":
>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>>>>>>>
>>>>>>>> # Skip search:
>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>>>>>>> # ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/discover\?.*
>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.*
>>>>>>>>
>>>>>>>> # Skip statistics:
>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>>>>>>>
>>>>>>>> Exclude from index:
>>>>>>>> # Exclude front page - no valuable info and we have QL:
>>>>>>>> https?://www\.duo\.uio\.no/$
>>>>>>>>
>>>>>>>> # Do not index navigation, but follow:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>>>>>>> #ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>>>>>>>
>>>>>>>> # Exclude id's lower than four, probably category listening:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>>>>>>> # ny url-strultur:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>>>>>>>
>>>>>>>> Thanks for looking at this!
>>>>>>>>
>>>>>>>> BTW: Within an hour, I will be away from my computer and cannot test
>>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will
>>>>>>>>
>>>>>>> still be
>>>>>
>>>>>> able to read and answer emails.
>>>>>>>>
>>>>>>>> Erlend
>>>>>>>>
>>>>>>>>
>>>>>>>>  On 18.09.14 13:43, Karl Wright wrote:
>>>>>>>>>
>>>>>>>>> Hi Erlend,
>>>>>>>>>
>>>>>>>>> The "Interrupted: null" message with a -104 code means only that
>>>>>>>>> the
>>>>>>>>> fetch
>>>>>>>>> was interrupted by something.  Unfortunately, the message is not
>>>>>>>>>
>>>>>>>> clear
>>>>>
>>>>>> about what the cause of the interruption is.  This is unrelated to
>>>>>>>>> Zookeeper; but I agree that it is suspicious that many such
>>>>>>>>>
>>>>>>>> interruptions
>>>>>
>>>>>> appear right after robots is parsed.
>>>>>>>>>
>>>>>>>>> One cause of a -104 is when the target server forcibly drops the
>>>>>>>>> connection, so an InterruptedIOException is thrown.  Having a look
>>>>>>>>>
>>>>>>>> at the
>>>>>
>>>>>> timestamps for the fetch messages, it looks believable that you
>>>>>>>>> might
>>>>>>>>> have
>>>>>>>>> exceeded some predetermined limit on that machine.  They're all
>>>>>>>>>
>>>>>>>> within a
>>>>>
>>>>>> few milliseconds of each other.  When a robots file needs to be
>>>>>>>>> read,
>>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that
>>>>>>>>>
>>>>>>>> event
>>>>>
>>>>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>>>>>>>>>
>>>>>>>> your
>>>>>
>>>>>> throttling needs to be adjusted now that the rate limit bug has
>>>>>>>>> been
>>>>>>>>> fixed?
>>>>>>>>>
>>>>>>>>> I won't be able to work with this without at least your crawling
>>>>>>>>> parameters
>>>>>>>>> for the server in question.  I can ping that server so if you would
>>>>>>>>>
>>>>>>>> like
>>>>>
>>>>>> I
>>>>>>>>> can try crawling that server from here.
>>>>>>>>>
>>>>>>>>> For zookeeper, I would still try to either increase your tick count
>>>>>>>>>
>>>>>>>> to
>>>>>
>>>>>> maybe 10000, or better yet, find out why you periodically lose the
>>>>>>>>> ability
>>>>>>>>> to transmit pings from MCF to your zookeeper process.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
>>>>>>>>>
>>>>>>>> e.f.garasen@usit.uio.no
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  On 18.09.14 13:00, Karl Wright wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Erlend,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> please can you also add the manifoldcf log as well?
>>>>>>>>>>>
>>>>>>>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>>>>>>>
>>>>>>>>>> MCF works perfectly using the other jobs for the other hosts.
>>>>>>>>>> Take a
>>>>>>>>>> look
>>>>>>>>>> at the following once again. MCF is being interrupted:
>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH
>>>>>>>>>> URL|
>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>
>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>
>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>>>>>>>
>>>>>>>>>>  7C4096%7Corg.apache.manifoldcf.core.interfaces.
>>>>> ManifoldCFException%7C>
>>>>>
>>>>>> Interrupted: Interrupted: null
>>>>>>>>>>
>>>>>>>>>> You can find this entry near the other regarding the robots.txt
>>>>>>>>>>
>>>>>>>>> file:
>>>>>
>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>>>>>>>
>>>>>>>>>> Erlend
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Posted by Karl Wright <da...@gmail.com>.
Oops, sorry, wrong thread. RC1 did NOT pass.  Will close the RC2 thread in
a minute.
Karl

On Mon, Sep 22, 2014 at 6:31 AM, Karl Wright <da...@gmail.com> wrote:

> Three +1's, >72 hours.  Vote passes!
>
> Karl
>
> On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <e....@usit.uio.no>
> wrote:
>
>>
>> I'm able to fetch documents from www.duo.uio.no using file-based
>> synchronization, so there are no network problems.
>>
>> Anyway, I'll continue to test RC2. Even though I'm not able to use
>> Zookeeper-based synchronization on that host, I may find other
>> bugs/problems.
>>
>> Erlend
>>
>>
>> On 22.09.14 10:39, Erlend Garåsen wrote:
>>
>>>
>>> I can verify an eventually network problem by using file-based
>>> synchronization instead.
>>>
>>> I'll do that right away and test RC2 as well, even though you already
>>> have three +1's.
>>>
>>> The three other jobs I started before I left my office on Thursday did
>>> all complete successfully.
>>>
>>> Erlend
>>>
>>> On 19.09.14 12:27, Karl Wright wrote:
>>>
>>>> Well, it's crawled fine over night, with no issues whatsoever.  I'm
>>>> using a
>>>> Zookeeper setup, with MCF 1.7.1 RC1.
>>>>
>>>> I still maintain you've got something broken with the network in your
>>>> production machine.
>>>>
>>>> Karl
>>>>
>>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>  Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
>>>>> e.f.garasen@usit.uio.no> wrote:
>>>>>
>>>>>  I know. I used a lot of time to create the rules which seems to index
>>>>>> what we really want. Your observation is correct. Crawling Dspace
>>>>>> repositories are very difficult. A lot of nonsense pages we need to
>>>>>> filter
>>>>>> out.
>>>>>>
>>>>>> We have crawled this host the last two years using file based synch.
>>>>>>
>>>>>> I'm planning a new approach, i.e. using a connector etc.
>>>>>>
>>>>>> E
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>  On 18. sep. 2014, at 22:35, "Karl Wright" <da...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ok, I started this crawl.  It fetched and processed robots.txt
>>>>>>>
>>>>>> perfectly.
>>>>>>
>>>>>>> And then I saw the following: lots of fetches of fairly good-sized
>>>>>>> documents, with very few ingestions.  The documents that did not
>>>>>>> ingest
>>>>>>> look like this:
>>>>>>>
>>>>>>>
>>>>>>>  https://www.duo.uio.no/handle/10852/163/discover?order=DESC&
>>>>>> r...pp=100&sort_by=dc.date.issued_dt
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I think your index inclusion rules may be excluding most of the
>>>>>>> content.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <da...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>> Thanks -- I will probably not be able to get to this further until
>>>>>>>>
>>>>>>> tonight
>>>>>>
>>>>>>> anyhow.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>>>>>>>>
>>>>>>> e.f.garasen@usit.uio.no>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I tried to fetch documents by using curl from our prod server
>>>>>>>>> just in
>>>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask
>>>>>>>>>
>>>>>>>> the
>>>>>>
>>>>>>> webmaster of that host anyway, just to be sure.
>>>>>>>>>
>>>>>>>>> The interrupted message may have been caused by an abort of that
>>>>>>>>> job.
>>>>>>>>>
>>>>>>>>> I think I should just stop the problematic job and start all the
>>>>>>>>> other
>>>>>>>>> three remaining jobs instead. I bet they will all complete.
>>>>>>>>> Ideally we
>>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace
>>>>>>>>> resource. I
>>>>>>>>> have just contacted someone who is indexing Dspace resources. I
>>>>>>>>> guess
>>>>>>>>>
>>>>>>>> a
>>>>>>
>>>>>>> Dspace connector is a better approach.
>>>>>>>>>
>>>>>>>>> Below you'll find some parameters.
>>>>>>>>>
>>>>>>>>> REPOSITORY CONNECTION
>>>>>>>>> ---------------------
>>>>>>>>> Throttling -> max connections: 30
>>>>>>>>> Throttling -> Max fetches/min: 100
>>>>>>>>> Bandwith -> max connections: 25
>>>>>>>>> Bandwith -> max kbytes/sec: 8000
>>>>>>>>> Bandwith -> max fetches/min: 20
>>>>>>>>>
>>>>>>>>> JOB SETTINGS
>>>>>>>>> ------------
>>>>>>>>>
>>>>>>>>> Hop filters: Keep forever
>>>>>>>>>
>>>>>>>>> Seeds: https://www.duo.uio.no/
>>>>>>>>>
>>>>>>>>> Exclude from crawl:
>>>>>>>>> # Exclude some file types:
>>>>>>>>> \.gif$
>>>>>>>>> \.GIF$
>>>>>>>>> \.jpeg$
>>>>>>>>> \.JPEG$
>>>>>>>>> \.jpg$
>>>>>>>>> \.JPG$
>>>>>>>>> \.png$
>>>>>>>>> \.PNG$
>>>>>>>>> \.mpg$
>>>>>>>>> \.MPG$
>>>>>>>>> \.mpeg$
>>>>>>>>> \.MPEG$
>>>>>>>>> \.exe$
>>>>>>>>> \.bmp$
>>>>>>>>> \.BMP$
>>>>>>>>> \.mov$
>>>>>>>>> \.MOV$
>>>>>>>>> \.wmf$
>>>>>>>>> \.css$
>>>>>>>>> \.ico$
>>>>>>>>> \.ICO$
>>>>>>>>> \.mp2$
>>>>>>>>> \.mp3$
>>>>>>>>> \.mp4$
>>>>>>>>> \.wmv$
>>>>>>>>> \.tif$
>>>>>>>>> \.tiff$
>>>>>>>>> \.avi$
>>>>>>>>> \.ogg$
>>>>>>>>> \.ogv$
>>>>>>>>> \.zip$
>>>>>>>>> \.gz$
>>>>>>>>> \.psd$
>>>>>>>>>
>>>>>>>>> # TIKA-1011
>>>>>>>>> \.mhtml$
>>>>>>>>>
>>>>>>>>> # Exclude log files:
>>>>>>>>> \.log$
>>>>>>>>> \.logfile$
>>>>>>>>>
>>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>>>>>>>> https?://www\.duo\.uio\.no/sok/search.*
>>>>>>>>>
>>>>>>>>> # Andre elementer i DUO som skal ekskluderes:
>>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>>>>>>>> advanced-search|community-list|browse|password-login|
>>>>>>>>> inn|discover).*
>>>>>>>>>
>>>>>>>>> # Skip locale settings - makes duplicates:
>>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>>>>>>>>
>>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts:
>>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>>>>>>>>
>>>>>>>>> # skip full item record:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>>>>>>>>
>>>>>>>>> # Skip all navigations but "start with letter":
>>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>>>>>>>>
>>>>>>>>> # Skip search:
>>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.*
>>>>>>>>>
>>>>>>>>> # Skip statistics:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>>>>>>>>
>>>>>>>>> Exclude from index:
>>>>>>>>> # Exclude front page - no valuable info and we have QL:
>>>>>>>>> https?://www\.duo\.uio\.no/$
>>>>>>>>>
>>>>>>>>> # Do not index navigation, but follow:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>>>>>>>> #ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>>>>>>>>
>>>>>>>>> # Exclude id's lower than four, probably category listening:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>>>>>>>> # ny url-strultur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>>>>>>>>
>>>>>>>>> Thanks for looking at this!
>>>>>>>>>
>>>>>>>>> BTW: Within an hour, I will be away from my computer and cannot
>>>>>>>>> test
>>>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will
>>>>>>>>>
>>>>>>>> still be
>>>>>>
>>>>>>> able to read and answer emails.
>>>>>>>>>
>>>>>>>>> Erlend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  On 18.09.14 13:43, Karl Wright wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Erlend,
>>>>>>>>>>
>>>>>>>>>> The "Interrupted: null" message with a -104 code means only that
>>>>>>>>>> the
>>>>>>>>>> fetch
>>>>>>>>>> was interrupted by something.  Unfortunately, the message is not
>>>>>>>>>>
>>>>>>>>> clear
>>>>>>
>>>>>>> about what the cause of the interruption is.  This is unrelated to
>>>>>>>>>> Zookeeper; but I agree that it is suspicious that many such
>>>>>>>>>>
>>>>>>>>> interruptions
>>>>>>
>>>>>>> appear right after robots is parsed.
>>>>>>>>>>
>>>>>>>>>> One cause of a -104 is when the target server forcibly drops the
>>>>>>>>>> connection, so an InterruptedIOException is thrown.  Having a look
>>>>>>>>>>
>>>>>>>>> at the
>>>>>>
>>>>>>> timestamps for the fetch messages, it looks believable that you
>>>>>>>>>> might
>>>>>>>>>> have
>>>>>>>>>> exceeded some predetermined limit on that machine.  They're all
>>>>>>>>>>
>>>>>>>>> within a
>>>>>>
>>>>>>> few milliseconds of each other.  When a robots file needs to be
>>>>>>>>>> read,
>>>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that
>>>>>>>>>>
>>>>>>>>> event
>>>>>>
>>>>>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>>>>>>>>>>
>>>>>>>>> your
>>>>>>
>>>>>>> throttling needs to be adjusted now that the rate limit bug has
>>>>>>>>>> been
>>>>>>>>>> fixed?
>>>>>>>>>>
>>>>>>>>>> I won't be able to work with this without at least your crawling
>>>>>>>>>> parameters
>>>>>>>>>> for the server in question.  I can ping that server so if you
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>> like
>>>>>>
>>>>>>> I
>>>>>>>>>> can try crawling that server from here.
>>>>>>>>>>
>>>>>>>>>> For zookeeper, I would still try to either increase your tick
>>>>>>>>>> count
>>>>>>>>>>
>>>>>>>>> to
>>>>>>
>>>>>>> maybe 10000, or better yet, find out why you periodically lose the
>>>>>>>>>> ability
>>>>>>>>>> to transmit pings from MCF to your zookeeper process.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
>>>>>>>>>>
>>>>>>>>> e.f.garasen@usit.uio.no
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  On 18.09.14 13:00, Karl Wright wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Erlend,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> please can you also add the manifoldcf log as well?
>>>>>>>>>>>>
>>>>>>>>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>>>>>>>>
>>>>>>>>>>> MCF works perfectly using the other jobs for the other hosts.
>>>>>>>>>>> Take a
>>>>>>>>>>> look
>>>>>>>>>>> at the following once again. MCF is being interrupted:
>>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH
>>>>>>>>>>> URL|
>>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>>
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>
>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>>>>>>>>
>>>>>>>>>>>  7C4096%7Corg.apache.manifoldcf.core.interfaces.
>>>>>> ManifoldCFException%7C>
>>>>>>
>>>>>>> Interrupted: Interrupted: null
>>>>>>>>>>>
>>>>>>>>>>> You can find this entry near the other regarding the robots.txt
>>>>>>>>>>>
>>>>>>>>>> file:
>>>>>>
>>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>>>>>>>>
>>>>>>>>>>> Erlend
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>