You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Priya Arora <pr...@smartshore.nl> on 2019/09/02 08:47:07 UTC

Manifold CF-Non existent of URL

Hi,

I have a query regarding manifoldCF. Is this having some kind of
functionality to check, if the URL it is crawling, does exist actually or
page not found(404).

Like I have a requirement in which i am crawling data for university and
job i continuously running.After some period it found that the certain
URL's have been removed from University site but its is getting indexed
still also.

Some pages have been marked as status 404.
 How can manifold be automatise to check this , that if the URL is
corresponding to 404(does not  exist anymore), it should be indexed

Thanks
Priya.

Re: Manifold CF-Non existent of URL

Posted by Karl Wright <da...@gmail.com>.
Yes, if mcf receives a 404 response it will delete the document from the
index.

Continuous crawling though means the document may not be retried for a long
time.  Exponential back off is used.

Karl

On Tue, Sep 3, 2019, 1:36 AM Priya Arora <pr...@smartshore.nl> wrote:

> Yes its a  continuous   Job.
>
> On Tue, Sep 3, 2019 at 11:05 AM Priya Arora <pr...@smartshore.nl> wrote:
>
> > Hi ,
> > I am having a job Job:-myuniversity_intranet (which is crawling data from
> > intranet site) and the data has been indexed in an index.
> > My query here is, does manifold have some functionality to test a url
> > before indexing that whether the URL is existing or not?.
> > Likewise , in my index (say index name: abc), i am having URL(indexed).
> > URL:- https:myuniversity/reaserch/info(which is an intranet url). This
> URL
> > was existing earlier but not existing now, and resulting status is 404.
> >
> > Query is :- Can monifoldcf checks before indexing whether its status is
> > not equal to 404(that means it exists). if the URL exists in real only
> then
> > index otherwise skip that URL.
> > Does this setting can be implemented while configuring manifold cf job.,
> > or do I have to manually handle this in code.
> >
> >
> > Kind regards
> > Priya
> >
> > On Mon, Sep 2, 2019 at 8:19 PM Karl Wright <da...@gmail.com> wrote:
> >
> >> Hi,
> >> You aren't giving me enough information to know why your job isn't
> >> rechecking URLs.  Please tell me how your job is configured,
> specifically
> >> whether it's continuous or not.  Thanks.
> >>
> >> Karl
> >>
> >>
> >> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora <pr...@smartshore.nl> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have a query regarding manifoldCF. Is this having some kind of
> >> > functionality to check, if the URL it is crawling, does exist actually
> >> or
> >> > page not found(404).
> >> >
> >> > Like I have a requirement in which i am crawling data for university
> and
> >> > job i continuously running.After some period it found that the certain
> >> > URL's have been removed from University site but its is getting
> indexed
> >> > still also.
> >> >
> >> > Some pages have been marked as status 404.
> >> >  How can manifold be automatise to check this , that if the URL is
> >> > corresponding to 404(does not  exist anymore), it should be indexed
> >> >
> >> > Thanks
> >> > Priya.
> >> >
> >>
> >
>

Re: Manifold CF-Non existent of URL

Posted by Priya Arora <pr...@smartshore.nl>.
Yes its a  continuous   Job.

On Tue, Sep 3, 2019 at 11:05 AM Priya Arora <pr...@smartshore.nl> wrote:

> Hi ,
> I am having a job Job:-myuniversity_intranet (which is crawling data from
> intranet site) and the data has been indexed in an index.
> My query here is, does manifold have some functionality to test a url
> before indexing that whether the URL is existing or not?.
> Likewise , in my index (say index name: abc), i am having URL(indexed).
> URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL
> was existing earlier but not existing now, and resulting status is 404.
>
> Query is :- Can monifoldcf checks before indexing whether its status is
> not equal to 404(that means it exists). if the URL exists in real only then
> index otherwise skip that URL.
> Does this setting can be implemented while configuring manifold cf job.,
> or do I have to manually handle this in code.
>
>
> Kind regards
> Priya
>
> On Mon, Sep 2, 2019 at 8:19 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi,
>> You aren't giving me enough information to know why your job isn't
>> rechecking URLs.  Please tell me how your job is configured, specifically
>> whether it's continuous or not.  Thanks.
>>
>> Karl
>>
>>
>> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora <pr...@smartshore.nl> wrote:
>>
>> > Hi,
>> >
>> > I have a query regarding manifoldCF. Is this having some kind of
>> > functionality to check, if the URL it is crawling, does exist actually
>> or
>> > page not found(404).
>> >
>> > Like I have a requirement in which i am crawling data for university and
>> > job i continuously running.After some period it found that the certain
>> > URL's have been removed from University site but its is getting indexed
>> > still also.
>> >
>> > Some pages have been marked as status 404.
>> >  How can manifold be automatise to check this , that if the URL is
>> > corresponding to 404(does not  exist anymore), it should be indexed
>> >
>> > Thanks
>> > Priya.
>> >
>>
>

Re: Manifold CF-Non existent of URL

Posted by Priya Arora <pr...@smartshore.nl>.
Hi ,
I am having a job Job:-myuniversity_intranet (which is crawling data from
intranet site) and the data has been indexed in an index.
My query here is, does manifold have some functionality to test a url
before indexing that whether the URL is existing or not?.
Likewise , in my index (say index name: abc), i am having URL(indexed).
URL:- https:myuniversity/reaserch/info(which is an intranet url). This URL
was existing earlier but not existing now, and resulting status is 404.

Query is :- Can monifoldcf checks before indexing whether its status is not
equal to 404(that means it exists). if the URL exists in real only then
index otherwise skip that URL.
Does this setting can be implemented while configuring manifold cf job., or
do I have to manually handle this in code.


Kind regards
Priya

On Mon, Sep 2, 2019 at 8:19 PM Karl Wright <da...@gmail.com> wrote:

> Hi,
> You aren't giving me enough information to know why your job isn't
> rechecking URLs.  Please tell me how your job is configured, specifically
> whether it's continuous or not.  Thanks.
>
> Karl
>
>
> On Mon, Sep 2, 2019 at 4:47 AM Priya Arora <pr...@smartshore.nl> wrote:
>
> > Hi,
> >
> > I have a query regarding manifoldCF. Is this having some kind of
> > functionality to check, if the URL it is crawling, does exist actually or
> > page not found(404).
> >
> > Like I have a requirement in which i am crawling data for university and
> > job i continuously running.After some period it found that the certain
> > URL's have been removed from University site but its is getting indexed
> > still also.
> >
> > Some pages have been marked as status 404.
> >  How can manifold be automatise to check this , that if the URL is
> > corresponding to 404(does not  exist anymore), it should be indexed
> >
> > Thanks
> > Priya.
> >
>

Re: Manifold CF-Non existent of URL

Posted by Karl Wright <da...@gmail.com>.
Hi,
You aren't giving me enough information to know why your job isn't
rechecking URLs.  Please tell me how your job is configured, specifically
whether it's continuous or not.  Thanks.

Karl


On Mon, Sep 2, 2019 at 4:47 AM Priya Arora <pr...@smartshore.nl> wrote:

> Hi,
>
> I have a query regarding manifoldCF. Is this having some kind of
> functionality to check, if the URL it is crawling, does exist actually or
> page not found(404).
>
> Like I have a requirement in which i am crawling data for university and
> job i continuously running.After some period it found that the certain
> URL's have been removed from University site but its is getting indexed
> still also.
>
> Some pages have been marked as status 404.
>  How can manifold be automatise to check this , that if the URL is
> corresponding to 404(does not  exist anymore), it should be indexed
>
> Thanks
> Priya.
>