You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2022/03/25 13:17:11 UTC

Traffic issue to the server to duplicate records

Hi All,

In the webcrawler connector we are crawling a public as well intranet site.
But from the client side , we are getting complaints about the traffic due
to duplicate requests for a particular URL only.
" Some urls are retrieved 100s or 1000s of times and then often several
times in a second. We would expect that for each identical url a crawler
would send only 1 request to our server and that we would not see
duplicates."

Like for example :- GET /infonet/medewerkers/dashboard/, this url is called
from crawler many times approx every one second

[image: image.png]
This kind of URL's are many that send request to server many times from
manifold. After analysing and checking via POSTMAN (get requests) some of
the records/URL's from this document, it seems that they are redirected to
some other one's.
And when tried to ingest that URL from local system Manifoldcf got
"RESPONSECODENOTINDEXABLE".
*So what can be the reason for this so much traffic *due to
particular record. and if that is due to redirection , why manifoldcf keeps
on calling the same record many times(It should ideally be redirected to
the one and then ingest the redirected one). What protocol is being used
behind this.

Any help on this would be appreciated.!

Thanks
Ritika

Re: Traffic issue to the server to duplicate records

Posted by Karl Wright <da...@gmail.com>.

Documents that are shared across jobs have special logic to prevent them
from being removed from the queue.  That may be why you are seeing this.  I
don't think we have any ability to manage such documents any better.

Karl


On Thu, Apr 28, 2022 at 1:41 AM Priya Arora <pr...@smartshore.nl> wrote:

> Not really , Job runs and gets completed in approx 2 days. It used to
> crawl approx 6 lakhs of records via one job  and approx 10 lakhs of
> records(With total 3 jobs). These jobs are scheduled in such a way that,
> when one is completed fully and then the other starts.
>
> On Wed, Apr 27, 2022 at 8:00 PM Karl Wright <da...@gmail.com> wrote:
>
>> By your description, if the job runs in a short period of time, then it
>> will run again right after it finishes, which also would likely take a
>> short period of time.  On each job run the document in question would be
>> requeued and tried again and discarded. Does this sound like what you are
>> seeing?
>>
>> Karl
>>
>>
>> On Wed, Apr 27, 2022 at 9:10 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> So for any given job run, the document will be removed entirely at the
>>> end of that job run.  If you run the same job repeatedly (e.g. it is
>>> configured to run within a time window but NOT to start only at the
>>> beginning of that time window), then every time the job will run the
>>> document will be reseeded and rescanned.
>>>
>>> Perhaps you did not want your job to run in this way?
>>>
>>> Yes, checked it, the job is not running in this way. It is completed
>>> once and then runs for a next time after completion. Is it possible that
>>> Manifoldcf internally hits URL/server many times approx 1000-10000 times,
>>> because I have analysed a url is being hit approx 7k times in very less
>>> time.
>>>
>>> Can that be the case, that before preparing an array of
>>> documentIdentifier[] in processDocuments function , manifoldcf hits the
>>> server many times , like for - extract,fetch,process (it individually hits
>>> to server and many times). What can be the possible reason behind this.
>>>
>>> Thanks
>>> Ritika
>>>
>>> On Fri, Mar 25, 2022 at 7:11 PM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> RESPONSECODENOTINDEXABLE is found in the following Web crawler code:
>>>>
>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>       if (responseCode != 200)
>>>>       {
>>>>         if (Logging.connectors.isDebugEnabled())
>>>>           Logging.connectors.debug("Web: For document
>>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>>> "+responseCode);
>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>         errorDesc = "HTTP response code not indexable
>>>> ("+responseCode+")";
>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>         return;
>>>>       }
>>>>
>>>> activities.noDocument() is implemented as follows:
>>>>
>>>>     /** Remove the specified document from the search engine index, and
>>>> update the
>>>>     * recorded version information for the document.
>>>>     *@param documentIdentifier is the document's local identifier.
>>>>     *@param componentIdentifier is the component document identifier,
>>>> if any.
>>>>     *@param version is the version string to be recorded for the
>>>> document.
>>>>     */
>>>>     @Override
>>>>     public void noDocument(String documentIdentifier,
>>>>       String componentIdentifier,
>>>>
>>>>
>>>>       String version)
>>>>       throws ManifoldCFException, ServiceInterruption
>>>>
>>>>
>>>>     {
>>>>       // Special interpretation for empty version string; treat as if
>>>> the document doesn't exist
>>>>       // (by ignoring it and allowing it to be deleted later)
>>>>
>>>>
>>>>       String documentIdentifierHash =
>>>> ManifoldCF.hash(documentIdentifier);
>>>>
>>>>
>>>>       String componentIdentifierHash =
>>>> computeComponentIDHash(componentIdentifier);
>>>>
>>>> checkMultipleDispositions(documentIdentifier,componentIdentifier,componentIdentifierHash);
>>>>
>>>>       ingester.documentNoData(
>>>>
>>>> computePipelineSpecificationWithVersions(documentIdentifierHash,componentIdentifierHash,documentIdentifier),
>>>>         connectionName,documentIdentifierHash,componentIdentifierHash,
>>>>         version,
>>>>         connection.getACLAuthority(),
>>>>         currentTime,
>>>>         ingestLogger);
>>>>
>>>>       touchedSet.add(documentIdentifier);
>>>>       touchComponentSet(documentIdentifier,componentIdentifierHash);
>>>>     }
>>>>
>>>> So for any given job run, the document will be removed entirely at the
>>>> end of that job run.  If you run the same job repeatedly (e.g. it is
>>>> configured to run within a time window but NOT to start only at the
>>>> beginning of that time window), then every time the job will run the
>>>> document will be reseeded and rescanned.
>>>>
>>>> Perhaps you did not want your job to run in this way?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 25, 2022 at 9:17 AM ritika jain <ri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> In the webcrawler connector we are crawling a public as well intranet
>>>>> site. But from the client side , we are getting complaints about the
>>>>> traffic due to duplicate requests for a particular URL only.
>>>>> " Some urls are retrieved 100s or 1000s of times and then often
>>>>> several times in a second. We would expect that for each identical url a
>>>>> crawler would send only 1 request to our server and that we would not see
>>>>> duplicates."
>>>>>
>>>>> Like for example :- GET /infonet/medewerkers/dashboard/, this url is
>>>>> called from crawler many times approx every one second
>>>>>
>>>>> [image: image.png]
>>>>> This kind of URL's are many that send request to server many times
>>>>> from manifold. After analysing and checking via POSTMAN (get requests) some
>>>>> of the records/URL's from this document, it seems that they are redirected
>>>>> to some other one's.
>>>>> And when tried to ingest that URL from local system Manifoldcf got
>>>>> "RESPONSECODENOTINDEXABLE".
>>>>> *So what can be the reason for this so much traffic *due to
>>>>> particular record. and if that is due to redirection , why manifoldcf keeps
>>>>> on calling the same record many times(It should ideally be redirected to
>>>>> the one and then ingest the redirected one). What protocol is being used
>>>>> behind this.
>>>>>
>>>>> Any help on this would be appreciated.!
>>>>>
>>>>> Thanks
>>>>> Ritika
>>>>>
>>>>>

Re: Traffic issue to the server to duplicate records

Posted by Priya Arora <pr...@smartshore.nl>.

Not really , Job runs and gets completed in approx 2 days. It used to crawl
approx 6 lakhs of records via one job  and approx 10 lakhs of records(With
total 3 jobs). These jobs are scheduled in such a way that, when one is
completed fully and then the other starts.

On Wed, Apr 27, 2022 at 8:00 PM Karl Wright <da...@gmail.com> wrote:

> By your description, if the job runs in a short period of time, then it
> will run again right after it finishes, which also would likely take a
> short period of time.  On each job run the document in question would be
> requeued and tried again and discarded. Does this sound like what you are
> seeing?
>
> Karl
>
>
> On Wed, Apr 27, 2022 at 9:10 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> So for any given job run, the document will be removed entirely at the
>> end of that job run.  If you run the same job repeatedly (e.g. it is
>> configured to run within a time window but NOT to start only at the
>> beginning of that time window), then every time the job will run the
>> document will be reseeded and rescanned.
>>
>> Perhaps you did not want your job to run in this way?
>>
>> Yes, checked it, the job is not running in this way. It is completed once
>> and then runs for a next time after completion. Is it possible that
>> Manifoldcf internally hits URL/server many times approx 1000-10000 times,
>> because I have analysed a url is being hit approx 7k times in very less
>> time.
>>
>> Can that be the case, that before preparing an array of
>> documentIdentifier[] in processDocuments function , manifoldcf hits the
>> server many times , like for - extract,fetch,process (it individually hits
>> to server and many times). What can be the possible reason behind this.
>>
>> Thanks
>> Ritika
>>
>> On Fri, Mar 25, 2022 at 7:11 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> RESPONSECODENOTINDEXABLE is found in the following Web crawler code:
>>>
>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>       if (responseCode != 200)
>>>       {
>>>         if (Logging.connectors.isDebugEnabled())
>>>           Logging.connectors.debug("Web: For document
>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>> "+responseCode);
>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>         errorDesc = "HTTP response code not indexable
>>> ("+responseCode+")";
>>>         activities.noDocument(documentIdentifier,versionString);
>>>         return;
>>>       }
>>>
>>> activities.noDocument() is implemented as follows:
>>>
>>>     /** Remove the specified document from the search engine index, and
>>> update the
>>>     * recorded version information for the document.
>>>     *@param documentIdentifier is the document's local identifier.
>>>     *@param componentIdentifier is the component document identifier, if
>>> any.
>>>     *@param version is the version string to be recorded for the
>>> document.
>>>     */
>>>     @Override
>>>     public void noDocument(String documentIdentifier,
>>>       String componentIdentifier,
>>>
>>>
>>>       String version)
>>>       throws ManifoldCFException, ServiceInterruption
>>>
>>>
>>>     {
>>>       // Special interpretation for empty version string; treat as if
>>> the document doesn't exist
>>>       // (by ignoring it and allowing it to be deleted later)
>>>
>>>
>>>       String documentIdentifierHash =
>>> ManifoldCF.hash(documentIdentifier);
>>>
>>>
>>>       String componentIdentifierHash =
>>> computeComponentIDHash(componentIdentifier);
>>>
>>> checkMultipleDispositions(documentIdentifier,componentIdentifier,componentIdentifierHash);
>>>
>>>       ingester.documentNoData(
>>>
>>> computePipelineSpecificationWithVersions(documentIdentifierHash,componentIdentifierHash,documentIdentifier),
>>>         connectionName,documentIdentifierHash,componentIdentifierHash,
>>>         version,
>>>         connection.getACLAuthority(),
>>>         currentTime,
>>>         ingestLogger);
>>>
>>>       touchedSet.add(documentIdentifier);
>>>       touchComponentSet(documentIdentifier,componentIdentifierHash);
>>>     }
>>>
>>> So for any given job run, the document will be removed entirely at the
>>> end of that job run.  If you run the same job repeatedly (e.g. it is
>>> configured to run within a time window but NOT to start only at the
>>> beginning of that time window), then every time the job will run the
>>> document will be reseeded and rescanned.
>>>
>>> Perhaps you did not want your job to run in this way?
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Fri, Mar 25, 2022 at 9:17 AM ritika jain <ri...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> In the webcrawler connector we are crawling a public as well intranet
>>>> site. But from the client side , we are getting complaints about the
>>>> traffic due to duplicate requests for a particular URL only.
>>>> " Some urls are retrieved 100s or 1000s of times and then often
>>>> several times in a second. We would expect that for each identical url a
>>>> crawler would send only 1 request to our server and that we would not see
>>>> duplicates."
>>>>
>>>> Like for example :- GET /infonet/medewerkers/dashboard/, this url is
>>>> called from crawler many times approx every one second
>>>>
>>>> [image: image.png]
>>>> This kind of URL's are many that send request to server many times from
>>>> manifold. After analysing and checking via POSTMAN (get requests) some of
>>>> the records/URL's from this document, it seems that they are redirected to
>>>> some other one's.
>>>> And when tried to ingest that URL from local system Manifoldcf got
>>>> "RESPONSECODENOTINDEXABLE".
>>>> *So what can be the reason for this so much traffic *due to
>>>> particular record. and if that is due to redirection , why manifoldcf keeps
>>>> on calling the same record many times(It should ideally be redirected to
>>>> the one and then ingest the redirected one). What protocol is being used
>>>> behind this.
>>>>
>>>> Any help on this would be appreciated.!
>>>>
>>>> Thanks
>>>> Ritika
>>>>
>>>>

Re: Traffic issue to the server to duplicate records

Posted by Karl Wright <da...@gmail.com>.

By your description, if the job runs in a short period of time, then it
will run again right after it finishes, which also would likely take a
short period of time.  On each job run the document in question would be
requeued and tried again and discarded. Does this sound like what you are
seeing?

Karl


On Wed, Apr 27, 2022 at 9:10 AM ritika jain <ri...@gmail.com>
wrote:

> So for any given job run, the document will be removed entirely at the end
> of that job run.  If you run the same job repeatedly (e.g. it is configured
> to run within a time window but NOT to start only at the beginning of that
> time window), then every time the job will run the document will be
> reseeded and rescanned.
>
> Perhaps you did not want your job to run in this way?
>
> Yes, checked it, the job is not running in this way. It is completed once
> and then runs for a next time after completion. Is it possible that
> Manifoldcf internally hits URL/server many times approx 1000-10000 times,
> because I have analysed a url is being hit approx 7k times in very less
> time.
>
> Can that be the case, that before preparing an array of
> documentIdentifier[] in processDocuments function , manifoldcf hits the
> server many times , like for - extract,fetch,process (it individually hits
> to server and many times). What can be the possible reason behind this.
>
> Thanks
> Ritika
>
> On Fri, Mar 25, 2022 at 7:11 PM Karl Wright <da...@gmail.com> wrote:
>
>> RESPONSECODENOTINDEXABLE is found in the following Web crawler code:
>>
>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>       if (responseCode != 200)
>>       {
>>         if (Logging.connectors.isDebugEnabled())
>>           Logging.connectors.debug("Web: For document
>> '"+documentIdentifier+"', not indexing because response code not indexable:
>> "+responseCode);
>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>>         activities.noDocument(documentIdentifier,versionString);
>>         return;
>>       }
>>
>> activities.noDocument() is implemented as follows:
>>
>>     /** Remove the specified document from the search engine index, and
>> update the
>>     * recorded version information for the document.
>>     *@param documentIdentifier is the document's local identifier.
>>     *@param componentIdentifier is the component document identifier, if
>> any.
>>     *@param version is the version string to be recorded for the document.
>>     */
>>     @Override
>>     public void noDocument(String documentIdentifier,
>>       String componentIdentifier,
>>
>>
>>       String version)
>>       throws ManifoldCFException, ServiceInterruption
>>
>>
>>     {
>>       // Special interpretation for empty version string; treat as if the
>> document doesn't exist
>>       // (by ignoring it and allowing it to be deleted later)
>>
>>
>>       String documentIdentifierHash =
>> ManifoldCF.hash(documentIdentifier);
>>
>>
>>       String componentIdentifierHash =
>> computeComponentIDHash(componentIdentifier);
>>
>> checkMultipleDispositions(documentIdentifier,componentIdentifier,componentIdentifierHash);
>>
>>       ingester.documentNoData(
>>
>> computePipelineSpecificationWithVersions(documentIdentifierHash,componentIdentifierHash,documentIdentifier),
>>         connectionName,documentIdentifierHash,componentIdentifierHash,
>>         version,
>>         connection.getACLAuthority(),
>>         currentTime,
>>         ingestLogger);
>>
>>       touchedSet.add(documentIdentifier);
>>       touchComponentSet(documentIdentifier,componentIdentifierHash);
>>     }
>>
>> So for any given job run, the document will be removed entirely at the
>> end of that job run.  If you run the same job repeatedly (e.g. it is
>> configured to run within a time window but NOT to start only at the
>> beginning of that time window), then every time the job will run the
>> document will be reseeded and rescanned.
>>
>> Perhaps you did not want your job to run in this way?
>>
>> Karl
>>
>>
>>
>>
>> On Fri, Mar 25, 2022 at 9:17 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> In the webcrawler connector we are crawling a public as well intranet
>>> site. But from the client side , we are getting complaints about the
>>> traffic due to duplicate requests for a particular URL only.
>>> " Some urls are retrieved 100s or 1000s of times and then often several
>>> times in a second. We would expect that for each identical url a crawler
>>> would send only 1 request to our server and that we would not see
>>> duplicates."
>>>
>>> Like for example :- GET /infonet/medewerkers/dashboard/, this url is
>>> called from crawler many times approx every one second
>>>
>>> [image: image.png]
>>> This kind of URL's are many that send request to server many times from
>>> manifold. After analysing and checking via POSTMAN (get requests) some of
>>> the records/URL's from this document, it seems that they are redirected to
>>> some other one's.
>>> And when tried to ingest that URL from local system Manifoldcf got
>>> "RESPONSECODENOTINDEXABLE".
>>> *So what can be the reason for this so much traffic *due to
>>> particular record. and if that is due to redirection , why manifoldcf keeps
>>> on calling the same record many times(It should ideally be redirected to
>>> the one and then ingest the redirected one). What protocol is being used
>>> behind this.
>>>
>>> Any help on this would be appreciated.!
>>>
>>> Thanks
>>> Ritika
>>>
>>>

Re: Traffic issue to the server to duplicate records

Posted by ritika jain <ri...@gmail.com>.

So for any given job run, the document will be removed entirely at the end
of that job run.  If you run the same job repeatedly (e.g. it is configured
to run within a time window but NOT to start only at the beginning of that
time window), then every time the job will run the document will be
reseeded and rescanned.

Perhaps you did not want your job to run in this way?

Yes, checked it, the job is not running in this way. It is completed once
and then runs for a next time after completion. Is it possible that
Manifoldcf internally hits URL/server many times approx 1000-10000 times,
because I have analysed a url is being hit approx 7k times in very less
time.

Can that be the case, that before preparing an array of
documentIdentifier[] in processDocuments function , manifoldcf hits the
server many times , like for - extract,fetch,process (it individually hits
to server and many times). What can be the possible reason behind this.

Thanks
Ritika

On Fri, Mar 25, 2022 at 7:11 PM Karl Wright <da...@gmail.com> wrote:

> RESPONSECODENOTINDEXABLE is found in the following Web crawler code:
>
>       int responseCode = cache.getResponseCode(documentIdentifier);
>       if (responseCode != 200)
>       {
>         if (Logging.connectors.isDebugEnabled())
>           Logging.connectors.debug("Web: For document
> '"+documentIdentifier+"', not indexing because response code not indexable:
> "+responseCode);
>         errorCode = "RESPONSECODENOTINDEXABLE";
>         errorDesc = "HTTP response code not indexable ("+responseCode+")";
>         activities.noDocument(documentIdentifier,versionString);
>         return;
>       }
>
> activities.noDocument() is implemented as follows:
>
>     /** Remove the specified document from the search engine index, and
> update the
>     * recorded version information for the document.
>     *@param documentIdentifier is the document's local identifier.
>     *@param componentIdentifier is the component document identifier, if
> any.
>     *@param version is the version string to be recorded for the document.
>     */
>     @Override
>     public void noDocument(String documentIdentifier,
>       String componentIdentifier,
>
>
>       String version)
>       throws ManifoldCFException, ServiceInterruption
>
>
>     {
>       // Special interpretation for empty version string; treat as if the
> document doesn't exist
>       // (by ignoring it and allowing it to be deleted later)
>
>
>       String documentIdentifierHash = ManifoldCF.hash(documentIdentifier);
>
>
>       String componentIdentifierHash =
> computeComponentIDHash(componentIdentifier);
>
> checkMultipleDispositions(documentIdentifier,componentIdentifier,componentIdentifierHash);
>
>       ingester.documentNoData(
>
> computePipelineSpecificationWithVersions(documentIdentifierHash,componentIdentifierHash,documentIdentifier),
>         connectionName,documentIdentifierHash,componentIdentifierHash,
>         version,
>         connection.getACLAuthority(),
>         currentTime,
>         ingestLogger);
>
>       touchedSet.add(documentIdentifier);
>       touchComponentSet(documentIdentifier,componentIdentifierHash);
>     }
>
> So for any given job run, the document will be removed entirely at the end
> of that job run.  If you run the same job repeatedly (e.g. it is configured
> to run within a time window but NOT to start only at the beginning of that
> time window), then every time the job will run the document will be
> reseeded and rescanned.
>
> Perhaps you did not want your job to run in this way?
>
> Karl
>
>
>
>
> On Fri, Mar 25, 2022 at 9:17 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> In the webcrawler connector we are crawling a public as well intranet
>> site. But from the client side , we are getting complaints about the
>> traffic due to duplicate requests for a particular URL only.
>> " Some urls are retrieved 100s or 1000s of times and then often several
>> times in a second. We would expect that for each identical url a crawler
>> would send only 1 request to our server and that we would not see
>> duplicates."
>>
>> Like for example :- GET /infonet/medewerkers/dashboard/, this url is
>> called from crawler many times approx every one second
>>
>> [image: image.png]
>> This kind of URL's are many that send request to server many times from
>> manifold. After analysing and checking via POSTMAN (get requests) some of
>> the records/URL's from this document, it seems that they are redirected to
>> some other one's.
>> And when tried to ingest that URL from local system Manifoldcf got
>> "RESPONSECODENOTINDEXABLE".
>> *So what can be the reason for this so much traffic *due to
>> particular record. and if that is due to redirection , why manifoldcf keeps
>> on calling the same record many times(It should ideally be redirected to
>> the one and then ingest the redirected one). What protocol is being used
>> behind this.
>>
>> Any help on this would be appreciated.!
>>
>> Thanks
>> Ritika
>>
>>

Re: Traffic issue to the server to duplicate records

Posted by Karl Wright <da...@gmail.com>.

RESPONSECODENOTINDEXABLE is found in the following Web crawler code:

      int responseCode = cache.getResponseCode(documentIdentifier);
      if (responseCode != 200)
      {
        if (Logging.connectors.isDebugEnabled())
          Logging.connectors.debug("Web: For document
'"+documentIdentifier+"', not indexing because response code not indexable:
"+responseCode);
        errorCode = "RESPONSECODENOTINDEXABLE";
        errorDesc = "HTTP response code not indexable ("+responseCode+")";
        activities.noDocument(documentIdentifier,versionString);
        return;
      }

activities.noDocument() is implemented as follows:

    /** Remove the specified document from the search engine index, and
update the
    * recorded version information for the document.
    *@param documentIdentifier is the document's local identifier.
    *@param componentIdentifier is the component document identifier, if
any.
    *@param version is the version string to be recorded for the document.
    */
    @Override
    public void noDocument(String documentIdentifier,
      String componentIdentifier,


      String version)
      throws ManifoldCFException, ServiceInterruption


    {
      // Special interpretation for empty version string; treat as if the
document doesn't exist
      // (by ignoring it and allowing it to be deleted later)


      String documentIdentifierHash = ManifoldCF.hash(documentIdentifier);


      String componentIdentifierHash =
computeComponentIDHash(componentIdentifier);

checkMultipleDispositions(documentIdentifier,componentIdentifier,componentIdentifierHash);

      ingester.documentNoData(

computePipelineSpecificationWithVersions(documentIdentifierHash,componentIdentifierHash,documentIdentifier),
        connectionName,documentIdentifierHash,componentIdentifierHash,
        version,
        connection.getACLAuthority(),
        currentTime,
        ingestLogger);

      touchedSet.add(documentIdentifier);
      touchComponentSet(documentIdentifier,componentIdentifierHash);
    }

So for any given job run, the document will be removed entirely at the end
of that job run.  If you run the same job repeatedly (e.g. it is configured
to run within a time window but NOT to start only at the beginning of that
time window), then every time the job will run the document will be
reseeded and rescanned.

Perhaps you did not want your job to run in this way?

Karl




On Fri, Mar 25, 2022 at 9:17 AM ritika jain <ri...@gmail.com>
wrote:

> Hi All,
>
> In the webcrawler connector we are crawling a public as well intranet
> site. But from the client side , we are getting complaints about the
> traffic due to duplicate requests for a particular URL only.
> " Some urls are retrieved 100s or 1000s of times and then often several
> times in a second. We would expect that for each identical url a crawler
> would send only 1 request to our server and that we would not see
> duplicates."
>
> Like for example :- GET /infonet/medewerkers/dashboard/, this url is
> called from crawler many times approx every one second
>
> [image: image.png]
> This kind of URL's are many that send request to server many times from
> manifold. After analysing and checking via POSTMAN (get requests) some of
> the records/URL's from this document, it seems that they are redirected to
> some other one's.
> And when tried to ingest that URL from local system Manifoldcf got
> "RESPONSECODENOTINDEXABLE".
> *So what can be the reason for this so much traffic *due to
> particular record. and if that is due to redirection , why manifoldcf keeps
> on calling the same record many times(It should ideally be redirected to
> the one and then ingest the redirected one). What protocol is being used
> behind this.
>
> Any help on this would be appreciated.!
>
> Thanks
> Ritika
>
>