You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/06/20 16:53:32 UTC

Excluding html files and following links

I just realized that if I exclude html files for a job, links in these 
files will not be followed. Is this a desirable behaviour? Should links 
be followed regardless of the exclude filter?

I discovered this issue when I was going to crawl only pdfs and realized 
that the job ended without finding any documents at all. I think I had 
something like this in my include list:
http://foreninger.uio.no/.*\.pdf$
http://folk.uio.no/.*\.pdf$

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Excluding html files and following links

Posted by Erlend Garåsen <e....@usit.uio.no>.

Unfortunately, I have only created a local Jira ticket (our own Jira at 
the uni) where this problem is reported. Since we are in a hurry with 
our search project at the moment, the issue is still open.

The host having large documents is now excluded from our crawl, mainly 
because it has been decided that we don't want to index it at all. The 
host includes private web pages, basically published by students.

I will keep you informed when we have made a decision about what to do 
with large documents. I guess the new parameter will do the trick. 
Thanks for working on this issue!

Erlend

On 05.07.11 12.04, Karl Wright wrote:
> Have you had a look at the feature added, and does it work for you?
> I'd also still be interested in knowing where you are seeing
> out-of-memory situations.
>
> Karl
>
> On Thu, Jun 23, 2011 at 8:03 AM, Karl Wright<da...@gmail.com>  wrote:
>> Hi Erlend,
>>
>> I hope you are not seeing memory issues on large files with ManifoldCF
>> itself.  That should not happen, and if it does we need to figure out
>> why.
>>
>> Solr memory issues, on the other hand, I can believe.  If that is the
>> problem, then I agree we should try to do something about it.
>> Probably the right thing to do is (since it is a Solr limitation)
>> adding a configuration parameter to the Solr connector that specifies
>> the maximum size of a file the connection will accept.  Files larger
>> than that should return a 400 if indexing is attempted, etc.
>>
>> Perhaps we should also consider adding a new method to the
>> IOutputConnector interface that returns a maximum file size value, and
>> expose that in IVersionActivity and IProcessActivity.  That would
>> allow connectors to make output-based decisions as to whether they
>> should fetch large files in the first place.
>>
>> Karl
>>
>>
>> On Thu, Jun 23, 2011 at 7:32 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>>
>>> I will create a ticket today. Post filtering sounds like a good idea.
>>>
>>> Another thing. We are facing memory problems with huge documents. Maybe we
>>> should add another future in order to cope with such documents, for instance
>>> skip documents which exceed a preset size. We have discovered pdfs on 500
>>> MB. What do you think? Do we need such a future as well?
>>>
>>> Erlend
>>>
>>> On 23.06.11 12.08, Karl Wright wrote:
>>>>
>>>> Have there been any further developments on this thread?
>>>> Karl
>>>>
>>>> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<da...@gmail.com>    wrote:
>>>>>
>>>>> Sure.  But you've already convinced me we need a new feature. ;-)
>>>>>
>>>>> Karl
>>>>>
>>>>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<e....@usit.uio.no>
>>>>>   wrote:
>>>>>>
>>>>>> Sure, I can create a ticket. But first I want to discuss this issue with
>>>>>> the
>>>>>> two search consultants we have hired.
>>>>>>
>>>>>> I decided to post to the dev list in order to get some feedback on this
>>>>>> issue.
>>>>>>
>>>>>> Erlend
>>>>>>
>>>>>> On 20.06.11 18.00, Karl Wright wrote:
>>>>>>>
>>>>>>> Hi Erlend,
>>>>>>>
>>>>>>> The inclusions and exclusions are based solely on URL, and block the
>>>>>>> connector from fetching the file.  Otherwise you would easily wind up
>>>>>>> fetching the entire web.
>>>>>>>
>>>>>>> However, this raises an interesting issue as to whether there's a way
>>>>>>> in the web connector to do what you are trying to do, which is to
>>>>>>> filter based on URL after links have been extracted.  The current
>>>>>>> inclusions/exclusions work fine for any URLs without links but do not
>>>>>>> allow for the case you are looking for.
>>>>>>>
>>>>>>> Can you create a ticket?  The suggestion would be to introduce
>>>>>>> post-extraction inclusions and exclusions into the connector.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>>>>>> <e....@usit.uio.no>      wrote:
>>>>>>>>
>>>>>>>> I just realized that if I exclude html files for a job, links in these
>>>>>>>> files
>>>>>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>>>>>> followed regardless of the exclude filter?
>>>>>>>>
>>>>>>>> I discovered this issue when I was going to crawl only pdfs and
>>>>>>>> realized
>>>>>>>> that the job ended without finding any documents at all. I think I had
>>>>>>>> something like this in my include list:
>>>>>>>> http://foreninger.uio.no/.*\.pdf$
>>>>>>>> http://folk.uio.no/.*\.pdf$
>>>>>>>>
>>>>>>>> Erlend
>>>>>>>>
>>>>>>>> --
>>>>>>>> Erlend Garåsen
>>>>>>>> Center for Information Technology Services
>>>>>>>> University of Oslo
>>>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>>>> 31050
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Erlend Garåsen
>>>>>> Center for Information Technology Services
>>>>>> University of Oslo
>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>> 31050
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Excluding html files and following links

Posted by Karl Wright <da...@gmail.com>.

Have you had a look at the feature added, and does it work for you?
I'd also still be interested in knowing where you are seeing
out-of-memory situations.

Karl

On Thu, Jun 23, 2011 at 8:03 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Erlend,
>
> I hope you are not seeing memory issues on large files with ManifoldCF
> itself.  That should not happen, and if it does we need to figure out
> why.
>
> Solr memory issues, on the other hand, I can believe.  If that is the
> problem, then I agree we should try to do something about it.
> Probably the right thing to do is (since it is a Solr limitation)
> adding a configuration parameter to the Solr connector that specifies
> the maximum size of a file the connection will accept.  Files larger
> than that should return a 400 if indexing is attempted, etc.
>
> Perhaps we should also consider adding a new method to the
> IOutputConnector interface that returns a maximum file size value, and
> expose that in IVersionActivity and IProcessActivity.  That would
> allow connectors to make output-based decisions as to whether they
> should fetch large files in the first place.
>
> Karl
>
>
> On Thu, Jun 23, 2011 at 7:32 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>
>> I will create a ticket today. Post filtering sounds like a good idea.
>>
>> Another thing. We are facing memory problems with huge documents. Maybe we
>> should add another future in order to cope with such documents, for instance
>> skip documents which exceed a preset size. We have discovered pdfs on 500
>> MB. What do you think? Do we need such a future as well?
>>
>> Erlend
>>
>> On 23.06.11 12.08, Karl Wright wrote:
>>>
>>> Have there been any further developments on this thread?
>>> Karl
>>>
>>> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<da...@gmail.com>  wrote:
>>>>
>>>> Sure.  But you've already convinced me we need a new feature. ;-)
>>>>
>>>> Karl
>>>>
>>>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<e....@usit.uio.no>
>>>>  wrote:
>>>>>
>>>>> Sure, I can create a ticket. But first I want to discuss this issue with
>>>>> the
>>>>> two search consultants we have hired.
>>>>>
>>>>> I decided to post to the dev list in order to get some feedback on this
>>>>> issue.
>>>>>
>>>>> Erlend
>>>>>
>>>>> On 20.06.11 18.00, Karl Wright wrote:
>>>>>>
>>>>>> Hi Erlend,
>>>>>>
>>>>>> The inclusions and exclusions are based solely on URL, and block the
>>>>>> connector from fetching the file.  Otherwise you would easily wind up
>>>>>> fetching the entire web.
>>>>>>
>>>>>> However, this raises an interesting issue as to whether there's a way
>>>>>> in the web connector to do what you are trying to do, which is to
>>>>>> filter based on URL after links have been extracted.  The current
>>>>>> inclusions/exclusions work fine for any URLs without links but do not
>>>>>> allow for the case you are looking for.
>>>>>>
>>>>>> Can you create a ticket?  The suggestion would be to introduce
>>>>>> post-extraction inclusions and exclusions into the connector.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>>>>> <e....@usit.uio.no>    wrote:
>>>>>>>
>>>>>>> I just realized that if I exclude html files for a job, links in these
>>>>>>> files
>>>>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>>>>> followed regardless of the exclude filter?
>>>>>>>
>>>>>>> I discovered this issue when I was going to crawl only pdfs and
>>>>>>> realized
>>>>>>> that the job ended without finding any documents at all. I think I had
>>>>>>> something like this in my include list:
>>>>>>> http://foreninger.uio.no/.*\.pdf$
>>>>>>> http://folk.uio.no/.*\.pdf$
>>>>>>>
>>>>>>> Erlend
>>>>>>>
>>>>>>> --
>>>>>>> Erlend Garåsen
>>>>>>> Center for Information Technology Services
>>>>>>> University of Oslo
>>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>>> 31050
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Re: Excluding html files and following links

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

I hope you are not seeing memory issues on large files with ManifoldCF
itself.  That should not happen, and if it does we need to figure out
why.

Solr memory issues, on the other hand, I can believe.  If that is the
problem, then I agree we should try to do something about it.
Probably the right thing to do is (since it is a Solr limitation)
adding a configuration parameter to the Solr connector that specifies
the maximum size of a file the connection will accept.  Files larger
than that should return a 400 if indexing is attempted, etc.

Perhaps we should also consider adding a new method to the
IOutputConnector interface that returns a maximum file size value, and
expose that in IVersionActivity and IProcessActivity.  That would
allow connectors to make output-based decisions as to whether they
should fetch large files in the first place.

Karl


On Thu, Jun 23, 2011 at 7:32 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> I will create a ticket today. Post filtering sounds like a good idea.
>
> Another thing. We are facing memory problems with huge documents. Maybe we
> should add another future in order to cope with such documents, for instance
> skip documents which exceed a preset size. We have discovered pdfs on 500
> MB. What do you think? Do we need such a future as well?
>
> Erlend
>
> On 23.06.11 12.08, Karl Wright wrote:
>>
>> Have there been any further developments on this thread?
>> Karl
>>
>> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<da...@gmail.com>  wrote:
>>>
>>> Sure.  But you've already convinced me we need a new feature. ;-)
>>>
>>> Karl
>>>
>>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<e....@usit.uio.no>
>>>  wrote:
>>>>
>>>> Sure, I can create a ticket. But first I want to discuss this issue with
>>>> the
>>>> two search consultants we have hired.
>>>>
>>>> I decided to post to the dev list in order to get some feedback on this
>>>> issue.
>>>>
>>>> Erlend
>>>>
>>>> On 20.06.11 18.00, Karl Wright wrote:
>>>>>
>>>>> Hi Erlend,
>>>>>
>>>>> The inclusions and exclusions are based solely on URL, and block the
>>>>> connector from fetching the file.  Otherwise you would easily wind up
>>>>> fetching the entire web.
>>>>>
>>>>> However, this raises an interesting issue as to whether there's a way
>>>>> in the web connector to do what you are trying to do, which is to
>>>>> filter based on URL after links have been extracted.  The current
>>>>> inclusions/exclusions work fine for any URLs without links but do not
>>>>> allow for the case you are looking for.
>>>>>
>>>>> Can you create a ticket?  The suggestion would be to introduce
>>>>> post-extraction inclusions and exclusions into the connector.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>>>> <e....@usit.uio.no>    wrote:
>>>>>>
>>>>>> I just realized that if I exclude html files for a job, links in these
>>>>>> files
>>>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>>>> followed regardless of the exclude filter?
>>>>>>
>>>>>> I discovered this issue when I was going to crawl only pdfs and
>>>>>> realized
>>>>>> that the job ended without finding any documents at all. I think I had
>>>>>> something like this in my include list:
>>>>>> http://foreninger.uio.no/.*\.pdf$
>>>>>> http://folk.uio.no/.*\.pdf$
>>>>>>
>>>>>> Erlend
>>>>>>
>>>>>> --
>>>>>> Erlend Garåsen
>>>>>> Center for Information Technology Services
>>>>>> University of Oslo
>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>> 31050
>>>>>>
>>>>
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Excluding html files and following links

Posted by Erlend Garåsen <e....@usit.uio.no>.

I will create a ticket today. Post filtering sounds like a good idea.

Another thing. We are facing memory problems with huge documents. Maybe 
we should add another future in order to cope with such documents, for 
instance skip documents which exceed a preset size. We have discovered 
pdfs on 500 MB. What do you think? Do we need such a future as well?

Erlend

On 23.06.11 12.08, Karl Wright wrote:
> Have there been any further developments on this thread?
> Karl
>
> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<da...@gmail.com>  wrote:
>> Sure.  But you've already convinced me we need a new feature. ;-)
>>
>> Karl
>>
>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>>
>>> Sure, I can create a ticket. But first I want to discuss this issue with the
>>> two search consultants we have hired.
>>>
>>> I decided to post to the dev list in order to get some feedback on this
>>> issue.
>>>
>>> Erlend
>>>
>>> On 20.06.11 18.00, Karl Wright wrote:
>>>>
>>>> Hi Erlend,
>>>>
>>>> The inclusions and exclusions are based solely on URL, and block the
>>>> connector from fetching the file.  Otherwise you would easily wind up
>>>> fetching the entire web.
>>>>
>>>> However, this raises an interesting issue as to whether there's a way
>>>> in the web connector to do what you are trying to do, which is to
>>>> filter based on URL after links have been extracted.  The current
>>>> inclusions/exclusions work fine for any URLs without links but do not
>>>> allow for the case you are looking for.
>>>>
>>>> Can you create a ticket?  The suggestion would be to introduce
>>>> post-extraction inclusions and exclusions into the connector.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>>> <e....@usit.uio.no>    wrote:
>>>>>
>>>>> I just realized that if I exclude html files for a job, links in these
>>>>> files
>>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>>> followed regardless of the exclude filter?
>>>>>
>>>>> I discovered this issue when I was going to crawl only pdfs and realized
>>>>> that the job ended without finding any documents at all. I think I had
>>>>> something like this in my include list:
>>>>> http://foreninger.uio.no/.*\.pdf$
>>>>> http://folk.uio.no/.*\.pdf$
>>>>>
>>>>> Erlend
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Excluding html files and following links

Posted by Karl Wright <da...@gmail.com>.

Have there been any further developments on this thread?
Karl

On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright <da...@gmail.com> wrote:
> Sure.  But you've already convinced me we need a new feature. ;-)
>
> Karl
>
> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>
>> Sure, I can create a ticket. But first I want to discuss this issue with the
>> two search consultants we have hired.
>>
>> I decided to post to the dev list in order to get some feedback on this
>> issue.
>>
>> Erlend
>>
>> On 20.06.11 18.00, Karl Wright wrote:
>>>
>>> Hi Erlend,
>>>
>>> The inclusions and exclusions are based solely on URL, and block the
>>> connector from fetching the file.  Otherwise you would easily wind up
>>> fetching the entire web.
>>>
>>> However, this raises an interesting issue as to whether there's a way
>>> in the web connector to do what you are trying to do, which is to
>>> filter based on URL after links have been extracted.  The current
>>> inclusions/exclusions work fine for any URLs without links but do not
>>> allow for the case you are looking for.
>>>
>>> Can you create a ticket?  The suggestion would be to introduce
>>> post-extraction inclusions and exclusions into the connector.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>> <e....@usit.uio.no>  wrote:
>>>>
>>>> I just realized that if I exclude html files for a job, links in these
>>>> files
>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>> followed regardless of the exclude filter?
>>>>
>>>> I discovered this issue when I was going to crawl only pdfs and realized
>>>> that the job ended without finding any documents at all. I think I had
>>>> something like this in my include list:
>>>> http://foreninger.uio.no/.*\.pdf$
>>>> http://folk.uio.no/.*\.pdf$
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Re: Excluding html files and following links

Posted by Karl Wright <da...@gmail.com>.

Sure.  But you've already convinced me we need a new feature. ;-)

Karl

On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> Sure, I can create a ticket. But first I want to discuss this issue with the
> two search consultants we have hired.
>
> I decided to post to the dev list in order to get some feedback on this
> issue.
>
> Erlend
>
> On 20.06.11 18.00, Karl Wright wrote:
>>
>> Hi Erlend,
>>
>> The inclusions and exclusions are based solely on URL, and block the
>> connector from fetching the file.  Otherwise you would easily wind up
>> fetching the entire web.
>>
>> However, this raises an interesting issue as to whether there's a way
>> in the web connector to do what you are trying to do, which is to
>> filter based on URL after links have been extracted.  The current
>> inclusions/exclusions work fine for any URLs without links but do not
>> allow for the case you are looking for.
>>
>> Can you create a ticket?  The suggestion would be to introduce
>> post-extraction inclusions and exclusions into the connector.
>>
>> Karl
>>
>>
>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>> <e....@usit.uio.no>  wrote:
>>>
>>> I just realized that if I exclude html files for a job, links in these
>>> files
>>> will not be followed. Is this a desirable behaviour? Should links be
>>> followed regardless of the exclude filter?
>>>
>>> I discovered this issue when I was going to crawl only pdfs and realized
>>> that the job ended without finding any documents at all. I think I had
>>> something like this in my include list:
>>> http://foreninger.uio.no/.*\.pdf$
>>> http://folk.uio.no/.*\.pdf$
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Excluding html files and following links

Posted by Erlend Garåsen <e....@usit.uio.no>.

Sure, I can create a ticket. But first I want to discuss this issue with 
the two search consultants we have hired.

I decided to post to the dev list in order to get some feedback on this 
issue.

Erlend

On 20.06.11 18.00, Karl Wright wrote:
> Hi Erlend,
>
> The inclusions and exclusions are based solely on URL, and block the
> connector from fetching the file.  Otherwise you would easily wind up
> fetching the entire web.
>
> However, this raises an interesting issue as to whether there's a way
> in the web connector to do what you are trying to do, which is to
> filter based on URL after links have been extracted.  The current
> inclusions/exclusions work fine for any URLs without links but do not
> allow for the case you are looking for.
>
> Can you create a ticket?  The suggestion would be to introduce
> post-extraction inclusions and exclusions into the connector.
>
> Karl
>
>
> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
> <e....@usit.uio.no>  wrote:
>>
>> I just realized that if I exclude html files for a job, links in these files
>> will not be followed. Is this a desirable behaviour? Should links be
>> followed regardless of the exclude filter?
>>
>> I discovered this issue when I was going to crawl only pdfs and realized
>> that the job ended without finding any documents at all. I think I had
>> something like this in my include list:
>> http://foreninger.uio.no/.*\.pdf$
>> http://folk.uio.no/.*\.pdf$
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Excluding html files and following links

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

The inclusions and exclusions are based solely on URL, and block the
connector from fetching the file.  Otherwise you would easily wind up
fetching the entire web.

However, this raises an interesting issue as to whether there's a way
in the web connector to do what you are trying to do, which is to
filter based on URL after links have been extracted.  The current
inclusions/exclusions work fine for any URLs without links but do not
allow for the case you are looking for.

Can you create a ticket?  The suggestion would be to introduce
post-extraction inclusions and exclusions into the connector.

Karl

On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
<e....@usit.uio.no> wrote:
>
> I just realized that if I exclude html files for a job, links in these files
> will not be followed. Is this a desirable behaviour? Should links be
> followed regardless of the exclude filter?
>
> I discovered this issue when I was going to crawl only pdfs and realized
> that the job ended without finding any documents at all. I think I had
> something like this in my include list:
> http://foreninger.uio.no/.*\.pdf$
> http://folk.uio.no/.*\.pdf$
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>