You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Arcadius Ahouansou <ar...@menelic.com> on 2015/04/28 13:01:57 UTC

Content filltering/exclusion with MCF

Hello.

I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.

MCF has ingested into Solr documents that returned HTTP error let's says
401, 403, 404 or have a certain content like "this page has expired and has
been removed"

The question is:
is there a way to tell MCF to ingest
- only document not containing a certain content like "Not Found" or
- only document excluding those with header 401, 403, 404, 500, ...

Thank you very much.

Arcadius.

Re: Content filltering/exclusion with MCF

Posted by Arcadius Ahouansou <ar...@menelic.com>.

Thanks Karl.
I will comment on it ASAP

On 30 April 2015 at 08:26, Karl Wright <da...@gmail.com> wrote:

> I've created a ticket to continue the discussion about whether we want
> such a feature and if so what it should look like.  CONNECTORS-1193.
>
> Karl
>
>
> On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Arcadius,
>>
>> The key question is, how big do you expect the dictionary to become?
>>
>> The current algorithm for finding content matches for determining whether
>> a page is part of a login sequence uses regexps on a line-by-line basis.
>> This is not ideal because there is no guarantee that the text will have
>> line breaks, and so it might have to accumulate the entire document in
>> memory, which is obviously very bad.
>>
>> Content matching is currently done within the confines of html; the html
>> is parsed and only the content portions are matched.  Tags are not
>> checked.  If the aho-corasick algorithm is used, it would need to be done
>> the same way: one line at a time only.
>>
>> Karl
>>
>>
>>
>> On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <arcadius@menelic.com
>> > wrote:
>>
>>>
>>> Hello Karl.
>>>
>>> I agree, this would be slower than the usual filtering by url or HTTP
>>> header.
>>>
>>> On the other hand, this would be a very useful feature:
>>> Could be used to remove documents containing swear words from index or
>>> remove adult content or discard emails flagged as spam etc.
>>>
>>> Regarding the implementation.
>>> So far in MCF, regex have been used for pattern matching.
>>> In the case of a content filtering, the user will supply a king of
>>> "dictionary" that we will use to determine whether the document will go
>>> through or not.
>>> The dictionary can grow quite a bit.
>>>
>>> The other alternative to regex may be the Aho–Corasick string matching
>>> algorithm
>>> A java implementation can be found at
>>> https://github.com/robert-bor/aho-corasick
>>> Let's say in my dictionary, I have tow entries "expired" and "not found".
>>> the algorithm will return either "expired", "not found" or both
>>> depending on what it found in the document.
>>> This output could be used to decide whether to index it or not.
>>>
>>> In this specific case where we only want to exclude a content from the
>>> index, we could exit on the first match i.e no need to match the whole
>>> dictionary.
>>> There is a pull-request for dealing with that
>>> https://github.com/robert-bor/aho-corasick/pull/14
>>>
>>> Thanks.
>>>
>>> Arcadius.
>>>
>>> On 29 April 2015 at 22:50, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Arcadius,
>>>>
>>>> A feature like this is possible but could be very slow, since there's
>>>> no definite limit on the size of an html page.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <
>>>> arcadius@menelic.com> wrote:
>>>>
>>>>>
>>>>> Hello Karl.
>>>>>
>>>>> I have checked the Simple History and I could see deletions.
>>>>>
>>>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>>>> crawled data. That may be the reason why I have in Solr document that lead
>>>>> to 404.
>>>>>
>>>>> Clearing my Solr index and resetting the crawler may help solve my
>>>>> problem.
>>>>>
>>>>> On the other hand, some of the page I am crawling display friendly
>>>>> messages such as "The document you are looking for has expired" with a 200
>>>>> HTTP header instead of 404.
>>>>> How feasible would it be to exclude document from the index based on
>>>>> the content on the document?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Arcadius.
>>>>>
>>>>>
>>>>>
>>>>> On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:
>>>>>
>>>>>> Hi Arcadius,
>>>>>>
>>>>>> So, to be clear, the repository connection you are using is a web
>>>>>> connection type?
>>>>>>
>>>>>> The web connector has the following code which should prevent
>>>>>> indexing of any content that was received with a response type of 200:
>>>>>>
>>>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>>>       if (responseCode != 200)
>>>>>>       {
>>>>>>         if (Logging.connectors.isDebugEnabled())
>>>>>>           Logging.connectors.debug("Web: For document
>>>>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>>>>> "+responseCode);
>>>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>>>         errorDesc = "HTTP response code not indexable
>>>>>> ("+responseCode+")";
>>>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>>>         return;
>>>>>>       }
>>>>>>
>>>>>>
>>>>>> You should indeed see these cases logged in the simple history and no
>>>>>> document sent to Solr.  Is this not what you are seeing?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>>>> arcadius@menelic.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into
>>>>>>> Solr.
>>>>>>>
>>>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>>>> and has been removed"
>>>>>>>
>>>>>>> The question is:
>>>>>>> is there a way to tell MCF to ingest
>>>>>>> - only document not containing a certain content like "Not Found" or
>>>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Arcadius.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Arcadius Ahouansou
>>>>> Menelic Ltd | Information is Power
>>>>> M: 07908761999
>>>>> W: www.menelic.com
>>>>> ---
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Arcadius Ahouansou
>>> Menelic Ltd | Information is Power
>>> M: 07908761999
>>> W: www.menelic.com
>>> ---
>>>
>>
>>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: Content filltering/exclusion with MCF

Posted by Karl Wright <da...@gmail.com>.

I've created a ticket to continue the discussion about whether we want such
a feature and if so what it should look like.  CONNECTORS-1193.

Karl


On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Arcadius,
>
> The key question is, how big do you expect the dictionary to become?
>
> The current algorithm for finding content matches for determining whether
> a page is part of a login sequence uses regexps on a line-by-line basis.
> This is not ideal because there is no guarantee that the text will have
> line breaks, and so it might have to accumulate the entire document in
> memory, which is obviously very bad.
>
> Content matching is currently done within the confines of html; the html
> is parsed and only the content portions are matched.  Tags are not
> checked.  If the aho-corasick algorithm is used, it would need to be done
> the same way: one line at a time only.
>
> Karl
>
>
>
> On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
>
>>
>> Hello Karl.
>>
>> I agree, this would be slower than the usual filtering by url or HTTP
>> header.
>>
>> On the other hand, this would be a very useful feature:
>> Could be used to remove documents containing swear words from index or
>> remove adult content or discard emails flagged as spam etc.
>>
>> Regarding the implementation.
>> So far in MCF, regex have been used for pattern matching.
>> In the case of a content filtering, the user will supply a king of
>> "dictionary" that we will use to determine whether the document will go
>> through or not.
>> The dictionary can grow quite a bit.
>>
>> The other alternative to regex may be the Aho–Corasick string matching
>> algorithm
>> A java implementation can be found at
>> https://github.com/robert-bor/aho-corasick
>> Let's say in my dictionary, I have tow entries "expired" and "not found".
>> the algorithm will return either "expired", "not found" or both depending
>> on what it found in the document.
>> This output could be used to decide whether to index it or not.
>>
>> In this specific case where we only want to exclude a content from the
>> index, we could exit on the first match i.e no need to match the whole
>> dictionary.
>> There is a pull-request for dealing with that
>> https://github.com/robert-bor/aho-corasick/pull/14
>>
>> Thanks.
>>
>> Arcadius.
>>
>> On 29 April 2015 at 22:50, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Arcadius,
>>>
>>> A feature like this is possible but could be very slow, since there's no
>>> definite limit on the size of an html page.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <
>>> arcadius@menelic.com> wrote:
>>>
>>>>
>>>> Hello Karl.
>>>>
>>>> I have checked the Simple History and I could see deletions.
>>>>
>>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>>> crawled data. That may be the reason why I have in Solr document that lead
>>>> to 404.
>>>>
>>>> Clearing my Solr index and resetting the crawler may help solve my
>>>> problem.
>>>>
>>>> On the other hand, some of the page I am crawling display friendly
>>>> messages such as "The document you are looking for has expired" with a 200
>>>> HTTP header instead of 404.
>>>> How feasible would it be to exclude document from the index based on
>>>> the content on the document?
>>>>
>>>> Thank you very much.
>>>>
>>>> Arcadius.
>>>>
>>>>
>>>>
>>>> On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>>> Hi Arcadius,
>>>>>
>>>>> So, to be clear, the repository connection you are using is a web
>>>>> connection type?
>>>>>
>>>>> The web connector has the following code which should prevent indexing
>>>>> of any content that was received with a response type of 200:
>>>>>
>>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>>       if (responseCode != 200)
>>>>>       {
>>>>>         if (Logging.connectors.isDebugEnabled())
>>>>>           Logging.connectors.debug("Web: For document
>>>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>>>> "+responseCode);
>>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>>         errorDesc = "HTTP response code not indexable
>>>>> ("+responseCode+")";
>>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>>         return;
>>>>>       }
>>>>>
>>>>>
>>>>> You should indeed see these cases logged in the simple history and no
>>>>> document sent to Solr.  Is this not what you are seeing?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>>> arcadius@menelic.com> wrote:
>>>>>
>>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into
>>>>>> Solr.
>>>>>>
>>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>>> and has been removed"
>>>>>>
>>>>>> The question is:
>>>>>> is there a way to tell MCF to ingest
>>>>>> - only document not containing a certain content like "Not Found" or
>>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Arcadius.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Arcadius Ahouansou
>>>> Menelic Ltd | Information is Power
>>>> M: 07908761999
>>>> W: www.menelic.com
>>>> ---
>>>>
>>>
>>>
>>
>>
>> --
>> Arcadius Ahouansou
>> Menelic Ltd | Information is Power
>> M: 07908761999
>> W: www.menelic.com
>> ---
>>
>
>

Re: Content filltering/exclusion with MCF

Posted by Karl Wright <da...@gmail.com>.

Hi Arcadius,

The key question is, how big do you expect the dictionary to become?

The current algorithm for finding content matches for determining whether a
page is part of a login sequence uses regexps on a line-by-line basis.
This is not ideal because there is no guarantee that the text will have
line breaks, and so it might have to accumulate the entire document in
memory, which is obviously very bad.

Content matching is currently done within the confines of html; the html is
parsed and only the content portions are matched.  Tags are not checked.
If the aho-corasick algorithm is used, it would need to be done the same
way: one line at a time only.

Karl



On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <ar...@menelic.com>
wrote:

>
> Hello Karl.
>
> I agree, this would be slower than the usual filtering by url or HTTP
> header.
>
> On the other hand, this would be a very useful feature:
> Could be used to remove documents containing swear words from index or
> remove adult content or discard emails flagged as spam etc.
>
> Regarding the implementation.
> So far in MCF, regex have been used for pattern matching.
> In the case of a content filtering, the user will supply a king of
> "dictionary" that we will use to determine whether the document will go
> through or not.
> The dictionary can grow quite a bit.
>
> The other alternative to regex may be the Aho–Corasick string matching
> algorithm
> A java implementation can be found at
> https://github.com/robert-bor/aho-corasick
> Let's say in my dictionary, I have tow entries "expired" and "not found".
> the algorithm will return either "expired", "not found" or both depending
> on what it found in the document.
> This output could be used to decide whether to index it or not.
>
> In this specific case where we only want to exclude a content from the
> index, we could exit on the first match i.e no need to match the whole
> dictionary.
> There is a pull-request for dealing with that
> https://github.com/robert-bor/aho-corasick/pull/14
>
> Thanks.
>
> Arcadius.
>
> On 29 April 2015 at 22:50, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Arcadius,
>>
>> A feature like this is possible but could be very slow, since there's no
>> definite limit on the size of an html page.
>>
>> Karl
>>
>>
>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <arcadius@menelic.com
>> > wrote:
>>
>>>
>>> Hello Karl.
>>>
>>> I have checked the Simple History and I could see deletions.
>>>
>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>> crawled data. That may be the reason why I have in Solr document that lead
>>> to 404.
>>>
>>> Clearing my Solr index and resetting the crawler may help solve my
>>> problem.
>>>
>>> On the other hand, some of the page I am crawling display friendly
>>> messages such as "The document you are looking for has expired" with a 200
>>> HTTP header instead of 404.
>>> How feasible would it be to exclude document from the index based on the
>>> content on the document?
>>>
>>> Thank you very much.
>>>
>>> Arcadius.
>>>
>>>
>>>
>>> On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Arcadius,
>>>>
>>>> So, to be clear, the repository connection you are using is a web
>>>> connection type?
>>>>
>>>> The web connector has the following code which should prevent indexing
>>>> of any content that was received with a response type of 200:
>>>>
>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>       if (responseCode != 200)
>>>>       {
>>>>         if (Logging.connectors.isDebugEnabled())
>>>>           Logging.connectors.debug("Web: For document
>>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>>> "+responseCode);
>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>         errorDesc = "HTTP response code not indexable
>>>> ("+responseCode+")";
>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>         return;
>>>>       }
>>>>
>>>>
>>>> You should indeed see these cases logged in the simple history and no
>>>> document sent to Solr.  Is this not what you are seeing?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>> arcadius@menelic.com> wrote:
>>>>
>>>>>
>>>>> Hello.
>>>>>
>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>>>
>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>> and has been removed"
>>>>>
>>>>> The question is:
>>>>> is there a way to tell MCF to ingest
>>>>> - only document not containing a certain content like "Not Found" or
>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Arcadius.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Arcadius Ahouansou
>>> Menelic Ltd | Information is Power
>>> M: 07908761999
>>> W: www.menelic.com
>>> ---
>>>
>>
>>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>

Re: Content filltering/exclusion with MCF

Posted by Arcadius Ahouansou <ar...@menelic.com>.

Hello Karl.

I agree, this would be slower than the usual filtering by url or HTTP
header.

On the other hand, this would be a very useful feature:
Could be used to remove documents containing swear words from index or
remove adult content or discard emails flagged as spam etc.

Regarding the implementation.
So far in MCF, regex have been used for pattern matching.
In the case of a content filtering, the user will supply a king of
"dictionary" that we will use to determine whether the document will go
through or not.
The dictionary can grow quite a bit.

The other alternative to regex may be the Aho–Corasick string matching
algorithm
A java implementation can be found at
https://github.com/robert-bor/aho-corasick
Let's say in my dictionary, I have tow entries "expired" and "not found".
the algorithm will return either "expired", "not found" or both depending
on what it found in the document.
This output could be used to decide whether to index it or not.

In this specific case where we only want to exclude a content from the
index, we could exit on the first match i.e no need to match the whole
dictionary.
There is a pull-request for dealing with that
https://github.com/robert-bor/aho-corasick/pull/14

Thanks.

Arcadius.

On 29 April 2015 at 22:50, Karl Wright <da...@gmail.com> wrote:

> Hi Arcadius,
>
> A feature like this is possible but could be very slow, since there's no
> definite limit on the size of an html page.
>
> Karl
>
>
> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
>
>>
>> Hello Karl.
>>
>> I have checked the Simple History and I could see deletions.
>>
>> I have recently migrated my config to MCF 2.0.2 without migrating all
>> crawled data. That may be the reason why I have in Solr document that lead
>> to 404.
>>
>> Clearing my Solr index and resetting the crawler may help solve my
>> problem.
>>
>> On the other hand, some of the page I am crawling display friendly
>> messages such as "The document you are looking for has expired" with a 200
>> HTTP header instead of 404.
>> How feasible would it be to exclude document from the index based on the
>> content on the document?
>>
>> Thank you very much.
>>
>> Arcadius.
>>
>>
>>
>> On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Arcadius,
>>>
>>> So, to be clear, the repository connection you are using is a web
>>> connection type?
>>>
>>> The web connector has the following code which should prevent indexing
>>> of any content that was received with a response type of 200:
>>>
>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>       if (responseCode != 200)
>>>       {
>>>         if (Logging.connectors.isDebugEnabled())
>>>           Logging.connectors.debug("Web: For document
>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>> "+responseCode);
>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>         errorDesc = "HTTP response code not indexable
>>> ("+responseCode+")";
>>>         activities.noDocument(documentIdentifier,versionString);
>>>         return;
>>>       }
>>>
>>>
>>> You should indeed see these cases logged in the simple history and no
>>> document sent to Solr.  Is this not what you are seeing?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>> arcadius@menelic.com> wrote:
>>>
>>>>
>>>> Hello.
>>>>
>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>>
>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>> and has been removed"
>>>>
>>>> The question is:
>>>> is there a way to tell MCF to ingest
>>>> - only document not containing a certain content like "Not Found" or
>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>
>>>> Thank you very much.
>>>>
>>>> Arcadius.
>>>>
>>>
>>>
>>
>>
>> --
>> Arcadius Ahouansou
>> Menelic Ltd | Information is Power
>> M: 07908761999
>> W: www.menelic.com
>> ---
>>
>
>

-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: Content filltering/exclusion with MCF

Posted by Karl Wright <da...@gmail.com>.

Hi Arcadius,

A feature like this is possible but could be very slow, since there's no
definite limit on the size of an html page.

Karl


On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <ar...@menelic.com>
wrote:

>
> Hello Karl.
>
> I have checked the Simple History and I could see deletions.
>
> I have recently migrated my config to MCF 2.0.2 without migrating all
> crawled data. That may be the reason why I have in Solr document that lead
> to 404.
>
> Clearing my Solr index and resetting the crawler may help solve my problem.
>
> On the other hand, some of the page I am crawling display friendly
> messages such as "The document you are looking for has expired" with a 200
> HTTP header instead of 404.
> How feasible would it be to exclude document from the index based on the
> content on the document?
>
> Thank you very much.
>
> Arcadius.
>
>
>
> On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Arcadius,
>>
>> So, to be clear, the repository connection you are using is a web
>> connection type?
>>
>> The web connector has the following code which should prevent indexing of
>> any content that was received with a response type of 200:
>>
>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>       if (responseCode != 200)
>>       {
>>         if (Logging.connectors.isDebugEnabled())
>>           Logging.connectors.debug("Web: For document
>> '"+documentIdentifier+"', not indexing because response code not indexable:
>> "+responseCode);
>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>>         activities.noDocument(documentIdentifier,versionString);
>>         return;
>>       }
>>
>>
>> You should indeed see these cases logged in the simple history and no
>> document sent to Solr.  Is this not what you are seeing?
>>
>> Karl
>>
>>
>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <arcadius@menelic.com
>> > wrote:
>>
>>>
>>> Hello.
>>>
>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>
>>> MCF has ingested into Solr documents that returned HTTP error let's says
>>> 401, 403, 404 or have a certain content like "this page has expired and has
>>> been removed"
>>>
>>> The question is:
>>> is there a way to tell MCF to ingest
>>> - only document not containing a certain content like "Not Found" or
>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>
>>> Thank you very much.
>>>
>>> Arcadius.
>>>
>>
>>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>

Re: Content filltering/exclusion with MCF

Posted by Arcadius Ahouansou <ar...@menelic.com>.

Hello Karl.

I have checked the Simple History and I could see deletions.

I have recently migrated my config to MCF 2.0.2 without migrating all
crawled data. That may be the reason why I have in Solr document that lead
to 404.

Clearing my Solr index and resetting the crawler may help solve my problem.

On the other hand, some of the page I am crawling display friendly messages
such as "The document you are looking for has expired" with a 200 HTTP
header instead of 404.
How feasible would it be to exclude document from the index based on the
content on the document?

Thank you very much.

Arcadius.



On 28 April 2015 at 12:18, Karl Wright <da...@gmail.com> wrote:

> Hi Arcadius,
>
> So, to be clear, the repository connection you are using is a web
> connection type?
>
> The web connector has the following code which should prevent indexing of
> any content that was received with a response type of 200:
>
>       int responseCode = cache.getResponseCode(documentIdentifier);
>       if (responseCode != 200)
>       {
>         if (Logging.connectors.isDebugEnabled())
>           Logging.connectors.debug("Web: For document
> '"+documentIdentifier+"', not indexing because response code not indexable:
> "+responseCode);
>         errorCode = "RESPONSECODENOTINDEXABLE";
>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>         activities.noDocument(documentIdentifier,versionString);
>         return;
>       }
>
>
> You should indeed see these cases logged in the simple history and no
> document sent to Solr.  Is this not what you are seeing?
>
> Karl
>
>
> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <ar...@menelic.com>
> wrote:
>
>>
>> Hello.
>>
>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>
>> MCF has ingested into Solr documents that returned HTTP error let's says
>> 401, 403, 404 or have a certain content like "this page has expired and has
>> been removed"
>>
>> The question is:
>> is there a way to tell MCF to ingest
>> - only document not containing a certain content like "Not Found" or
>> - only document excluding those with header 401, 403, 404, 500, ...
>>
>> Thank you very much.
>>
>> Arcadius.
>>
>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: Content filltering/exclusion with MCF

Posted by Karl Wright <da...@gmail.com>.

Hi Arcadius,

So, to be clear, the repository connection you are using is a web
connection type?

The web connector has the following code which should prevent indexing of
any content that was received with a response type of 200:

      int responseCode = cache.getResponseCode(documentIdentifier);
      if (responseCode != 200)
      {
        if (Logging.connectors.isDebugEnabled())
          Logging.connectors.debug("Web: For document
'"+documentIdentifier+"', not indexing because response code not indexable:
"+responseCode);
        errorCode = "RESPONSECODENOTINDEXABLE";
        errorDesc = "HTTP response code not indexable ("+responseCode+")";
        activities.noDocument(documentIdentifier,versionString);
        return;
      }


You should indeed see these cases logged in the simple history and no
document sent to Solr.  Is this not what you are seeing?

Karl


On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <ar...@menelic.com>
wrote:

>
> Hello.
>
> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>
> MCF has ingested into Solr documents that returned HTTP error let's says
> 401, 403, 404 or have a certain content like "this page has expired and has
> been removed"
>
> The question is:
> is there a way to tell MCF to ingest
> - only document not containing a certain content like "Not Found" or
> - only document excluding those with header 401, 403, 404, 500, ...
>
> Thank you very much.
>
> Arcadius.
>