You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2020/05/07 10:27:28 UTC

Crawling / Indexation Query

Hi All,

Can any body explain
If a URL was indexed, and afterwards a noindex tag was added - will that
URL then be deleted from the index when it is visited again by the crawler?

Say a url was previously having indexation required meta tag and was
present in Elastic index, but then afterwards
<meta name="robots" content="nofollow, noindex">
was added to page design afterwards.

Should it be deleted from Index when the Manifoldcf job crawl that url
again or the URL will still be present in the index.

Thanks

Re: Crawling / Indexation Query

Posted by ritika jain <ri...@gmail.com>.
Many Thanks

On Thu, May 7, 2020 at 4:11 PM Karl Wright <da...@gmail.com> wrote:

> Hi,
>
> ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
> crawl something, then it will not be indexed.  If robots is changed to
> prohibit crawling of certain documents, then yes, those documents will be
> removed from the index.
>
> But you can override the robots behavior in the document specification or
> configuration, I believe.
>
> Karl
>
>
> On Thu, May 7, 2020 at 6:27 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Can any body explain
>> If a URL was indexed, and afterwards a noindex tag was added - will that
>> URL then be deleted from the index when it is visited again by the crawler?
>>
>>
>> Say a url was previously having indexation required meta tag and was
>> present in Elastic index, but then afterwards
>> <meta name="robots" content="nofollow, noindex">
>> was added to page design afterwards.
>>
>> Should it be deleted from Index when the Manifoldcf job crawl that url
>> again or the URL will still be present in the index.
>>
>> Thanks
>>
>>
>>
>

Re: Crawling / Indexation Query

Posted by Karl Wright <da...@gmail.com>.
We can't.  You need to follow the instructions and send email to the
appropriate address, listed here:

http://manifoldcf.apache.org/en_US/mail.html

Karl


On Sat, May 30, 2020 at 4:40 PM Shashank Saurabh <sh...@gmail.com>
wrote:

> Please unsubscribe me from your mailing list.
>
> Thanks,
> Shashank
>
> On Thu, May 7, 2020 at 4:11 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi,
>>
>> ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
>> crawl something, then it will not be indexed.  If robots is changed to
>> prohibit crawling of certain documents, then yes, those documents will be
>> removed from the index.
>>
>> But you can override the robots behavior in the document specification or
>> configuration, I believe.
>>
>> Karl
>>
>>
>> On Thu, May 7, 2020 at 6:27 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Can any body explain
>>> If a URL was indexed, and afterwards a noindex tag was added - will that
>>> URL then be deleted from the index when it is visited again by the crawler?
>>>
>>>
>>> Say a url was previously having indexation required meta tag and was
>>> present in Elastic index, but then afterwards
>>> <meta name="robots" content="nofollow, noindex">
>>> was added to page design afterwards.
>>>
>>> Should it be deleted from Index when the Manifoldcf job crawl that url
>>> again or the URL will still be present in the index.
>>>
>>> Thanks
>>>
>>>
>>>
>>

Re: Crawling / Indexation Query

Posted by Shashank Saurabh <sh...@gmail.com>.
Please unsubscribe me from your mailing list.

Thanks,
Shashank

On Thu, May 7, 2020 at 4:11 PM Karl Wright <da...@gmail.com> wrote:

> Hi,
>
> ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
> crawl something, then it will not be indexed.  If robots is changed to
> prohibit crawling of certain documents, then yes, those documents will be
> removed from the index.
>
> But you can override the robots behavior in the document specification or
> configuration, I believe.
>
> Karl
>
>
> On Thu, May 7, 2020 at 6:27 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Can any body explain
>> If a URL was indexed, and afterwards a noindex tag was added - will that
>> URL then be deleted from the index when it is visited again by the crawler?
>>
>>
>> Say a url was previously having indexation required meta tag and was
>> present in Elastic index, but then afterwards
>> <meta name="robots" content="nofollow, noindex">
>> was added to page design afterwards.
>>
>> Should it be deleted from Index when the Manifoldcf job crawl that url
>> again or the URL will still be present in the index.
>>
>> Thanks
>>
>>
>>
>

Re: Crawling / Indexation Query

Posted by Karl Wright <da...@gmail.com>.
Hi,

ManifoldCF is not a crawler, it's a synchronizer.  If robots says not to
crawl something, then it will not be indexed.  If robots is changed to
prohibit crawling of certain documents, then yes, those documents will be
removed from the index.

But you can override the robots behavior in the document specification or
configuration, I believe.

Karl


On Thu, May 7, 2020 at 6:27 AM ritika jain <ri...@gmail.com> wrote:

> Hi All,
>
> Can any body explain
> If a URL was indexed, and afterwards a noindex tag was added - will that
> URL then be deleted from the index when it is visited again by the crawler?
>
>
> Say a url was previously having indexation required meta tag and was
> present in Elastic index, but then afterwards
> <meta name="robots" content="nofollow, noindex">
> was added to page design afterwards.
>
> Should it be deleted from Index when the Manifoldcf job crawl that url
> again or the URL will still be present in the index.
>
> Thanks
>
>
>