You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2014/09/19 01:37:46 UTC

[CANCEL] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

CONNECTORS-1041.
Karl

On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <da...@gmail.com> wrote:

> Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>
> Karl
>
>
> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
> e.f.garasen@usit.uio.no> wrote:
>
>> I know. I used a lot of time to create the rules which seems to index
>> what we really want. Your observation is correct. Crawling Dspace
>> repositories are very difficult. A lot of nonsense pages we need to filter
>> out.
>>
>> We have crawled this host the last two years using file based synch.
>>
>> I'm planning a new approach, i.e. using a connector etc.
>>
>> E
>>
>> Sent from my iPhone
>>
>> > On 18. sep. 2014, at 22:35, "Karl Wright" <da...@gmail.com> wrote:
>> >
>> > Ok, I started this crawl.  It fetched and processed robots.txt
>> perfectly.
>> > And then I saw the following: lots of fetches of fairly good-sized
>> > documents, with very few ingestions.  The documents that did not ingest
>> > look like this:
>> >
>> >
>> https://www.duo.uio.no/handle/10852/163/discover?order=DESC&r...pp=100&sort_by=dc.date.issued_dt
>> >
>> >
>> > I think your index inclusion rules may be excluding most of the content.
>> >
>> > Karl
>> >
>> >
>> >
>> >> On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <da...@gmail.com>
>> wrote:
>> >>
>> >> Thanks -- I will probably not be able to get to this further until
>> tonight
>> >> anyhow.
>> >>
>> >> Karl
>> >>
>> >> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>> e.f.garasen@usit.uio.no>
>> >> wrote:
>> >>
>> >>>
>> >>> I tried to fetch documents by using curl from our prod server just in
>> >>> case a webmaster had blocked access. No problem. Maybe I should ask
>> the
>> >>> webmaster of that host anyway, just to be sure.
>> >>>
>> >>> The interrupted message may have been caused by an abort of that job.
>> >>>
>> >>> I think I should just stop the problematic job and start all the other
>> >>> three remaining jobs instead. I bet they will all complete. Ideally we
>> >>> shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
>> >>> have just contacted someone who is indexing Dspace resources. I guess
>> a
>> >>> Dspace connector is a better approach.
>> >>>
>> >>> Below you'll find some parameters.
>> >>>
>> >>> REPOSITORY CONNECTION
>> >>> ---------------------
>> >>> Throttling -> max connections: 30
>> >>> Throttling -> Max fetches/min: 100
>> >>> Bandwith -> max connections: 25
>> >>> Bandwith -> max kbytes/sec: 8000
>> >>> Bandwith -> max fetches/min: 20
>> >>>
>> >>> JOB SETTINGS
>> >>> ------------
>> >>>
>> >>> Hop filters: Keep forever
>> >>>
>> >>> Seeds: https://www.duo.uio.no/
>> >>>
>> >>> Exclude from crawl:
>> >>> # Exclude some file types:
>> >>> \.gif$
>> >>> \.GIF$
>> >>> \.jpeg$
>> >>> \.JPEG$
>> >>> \.jpg$
>> >>> \.JPG$
>> >>> \.png$
>> >>> \.PNG$
>> >>> \.mpg$
>> >>> \.MPG$
>> >>> \.mpeg$
>> >>> \.MPEG$
>> >>> \.exe$
>> >>> \.bmp$
>> >>> \.BMP$
>> >>> \.mov$
>> >>> \.MOV$
>> >>> \.wmf$
>> >>> \.css$
>> >>> \.ico$
>> >>> \.ICO$
>> >>> \.mp2$
>> >>> \.mp3$
>> >>> \.mp4$
>> >>> \.wmv$
>> >>> \.tif$
>> >>> \.tiff$
>> >>> \.avi$
>> >>> \.ogg$
>> >>> \.ogv$
>> >>> \.zip$
>> >>> \.gz$
>> >>> \.psd$
>> >>>
>> >>> # TIKA-1011
>> >>> \.mhtml$
>> >>>
>> >>> # Exclude log files:
>> >>> \.log$
>> >>> \.logfile$
>> >>>
>> >>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>> >>> https?://www\.duo\.uio\.no/sok/search.*
>> >>>
>> >>> # Andre elementer i DUO som skal ekskluderes:
>> >>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>> >>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>> >>> advanced-search|community-list|browse|password-login|inn|discover).*
>> >>>
>> >>> # Skip locale settings - makes duplicates:
>> >>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>> >>>
>> >>> # Temporarily skip PDFs since we are indexing abstracts:
>> >>> https://www\.duo\.uio\.no/bitstream/handle/.+
>> >>>
>> >>> # skip full item record:
>> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>> >>> # ny url-struktur:
>> >>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>> >>>
>> >>> # Skip all navigations but "start with letter":
>> >>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>> >>>
>> >>> # Skip search:
>> >>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>> >>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>> >>> # ny url-struktur:
>> >>> https://www\.duo\.uio\.no/discover\?.*
>> >>> https://www\.duo\.uio\.no/search-filter\?.*
>> >>>
>> >>> # Skip statistics:
>> >>> https://www\.duo\.uio\.no/handle/.*/statistics$
>> >>>
>> >>> Exclude from index:
>> >>> # Exclude front page - no valuable info and we have QL:
>> >>> https?://www\.duo\.uio\.no/$
>> >>>
>> >>> # Do not index navigation, but follow:
>> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>> >>> #ny url-struktur:
>> >>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>> >>>
>> >>> # Exclude id's lower than four, probably category listening:
>> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>> >>> # ny url-strultur:
>> >>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>> >>>
>> >>> Thanks for looking at this!
>> >>>
>> >>> BTW: Within an hour, I will be away from my computer and cannot test
>> >>> anymore until Monday. I'm leaving Oslo for some days, but I will
>> still be
>> >>> able to read and answer emails.
>> >>>
>> >>> Erlend
>> >>>
>> >>>
>> >>>> On 18.09.14 13:43, Karl Wright wrote:
>> >>>>
>> >>>> Hi Erlend,
>> >>>>
>> >>>> The "Interrupted: null" message with a -104 code means only that the
>> >>>> fetch
>> >>>> was interrupted by something.  Unfortunately, the message is not
>> clear
>> >>>> about what the cause of the interruption is.  This is unrelated to
>> >>>> Zookeeper; but I agree that it is suspicious that many such
>> interruptions
>> >>>> appear right after robots is parsed.
>> >>>>
>> >>>> One cause of a -104 is when the target server forcibly drops the
>> >>>> connection, so an InterruptedIOException is thrown.  Having a look
>> at the
>> >>>> timestamps for the fetch messages, it looks believable that you might
>> >>>> have
>> >>>> exceeded some predetermined limit on that machine.  They're all
>> within a
>> >>>> few milliseconds of each other.  When a robots file needs to be read,
>> >>>> ManifoldCF creates an event for that, and the urls blocked by that
>> event
>> >>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>> your
>> >>>> throttling needs to be adjusted now that the rate limit bug has been
>> >>>> fixed?
>> >>>>
>> >>>> I won't be able to work with this without at least your crawling
>> >>>> parameters
>> >>>> for the server in question.  I can ping that server so if you would
>> like
>> >>>> I
>> >>>> can try crawling that server from here.
>> >>>>
>> >>>> For zookeeper, I would still try to either increase your tick count
>> to
>> >>>> maybe 10000, or better yet, find out why you periodically lose the
>> >>>> ability
>> >>>> to transmit pings from MCF to your zookeeper process.
>> >>>>
>> >>>> Thanks,
>> >>>> Karl
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
>> e.f.garasen@usit.uio.no
>> >>>> wrote:
>> >>>>
>> >>>>> On 18.09.14 13:00, Karl Wright wrote:
>> >>>>>
>> >>>>> Hi Erlend,
>> >>>>>>
>> >>>>>> please can you also add the manifoldcf log as well?
>> >>>>> Yes, I will, but it includes entries from RC0 as well.
>> >>>>>
>> >>>>> MCF works perfectly using the other jobs for the other hosts. Take a
>> >>>>> look
>> >>>>> at the following once again. MCF is being interrupted:
>> >>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>> >>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>> >>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>> >>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>> >>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>> >>>>>
>> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
>> >>>>> Interrupted: Interrupted: null
>> >>>>>
>> >>>>> You can find this entry near the other regarding the robots.txt
>> file:
>> >>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>> >>>>>
>> >>>>> Erlend
>> >>
>>
>
>