You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Cheng Zeng <ze...@hotmail.co.uk> on 2018/11/12 09:06:26 UTC

Job stuck - WorkerThread functions return null

Hi Karl,

I am developing my own repository where I borrowed some code from the file repository connector. I use my repository connector to crawling documents from IBM domino system. I managed to retrieve all the files in the domino, however, when I restart my job to recrawl the database in the domino, I've got problems with the following code where previousDocuments.get(documentIdentifierHash) in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null for some of the document ids. As a result, the job got stuck with the specific document id.

Could you please tell me how I could fix the problem?

 protected IPipelineSpecificationWithVersions computePipelineSpecificationWithVersions(String documentIdentifierHash,
      String componentIdentifierHash,
      String documentIdentifier)
    {
      QueuedDocument qd = previousDocuments.get(documentIdentifierHash);  // return null. The problem is here.
      if (qd == null)
        throw new IllegalArgumentException("Unrecognized document identifier: '"+documentIdentifier+"'");
      return new PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
    }


Thanks a lot.

Cheng

Re: Job stuck - WorkerThread functions return null

Posted by Karl Wright <da...@gmail.com>.
Hi Cheng,

Unless you are using carrydown information (that is, information that is
recorded for a parent document that the child document needs access to),
this is the method you want to use:

activities.addDocumentReference(documentIdentifier);

If you DO need to pull data recorded for a parent from the child, the best
connector to look at for an example is the SharePoint connector.

As far as the stack trace is concerned -- these always get written to the
log.  The reason the framework "hangs" is because the exception is a fatal
one and is basically causes the thread to restart itself, and thus nothing
progresses under those conditions.  Very probably the cause of this
exception is that you are including a 'parent identifier' which is not
actually a document identifier that was itself added.

Karl


On Wed, Nov 14, 2018 at 2:16 AM Cheng Zeng <ze...@hotmail.co.uk> wrote:

>
> Hi Karl,
>
> Thanks a lot for your replay. I didn't change any code in the framework
> except my own repository connector.
>
> I found that there five methods which are available to inject document
> identifiers. Could you please tell me how I should choose the right way to
> inject the document identifiers.
>  activities.addDocumentReference(documentIdentifier);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues, originationTime);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues, originationTime, prereqEventNames);
>
> The way I injected document identifiers is as follows.
>
>
> activities.addDocumentReference(docUri,documentIdentifier,RELATIONSHIP_CHILD);
> docUri is the doc url which is supposed to be fetched, e.g.
> http://domino_server:80/path/dep1/database_name.nsf/api/data/documents
> documentIdentifier is the parent url, e.g.
> http://domino_server:80/path/dep1/database_name.nsf/api/data/documents/unid/B0F9484E94DEA3204825813E001034E1
>
> I am afraid that there is no full stack trace thrown. I have only got the
>
> new IllegalArgumentException("Unrecognized document identifier:
> '"+documentIdentifier+"'");
>
> with the following code in the WorkerThread.java(org.apache.manifoldcf.crawler.system).
> I've found the document identifier in the table of "jobqueue" and the
> dochash in the table of "jobqueue" is matched against the hashcode
> generated by the hash method.
>
> For some of the document identifiers,
> previousDocuments.get(documentIdentifierHash) can return the queued
> document, but for several document identifier,
> previousDocuments.get(documentIdentifierHash) return null.
>
> Could you please give me some indication?
>
> protected IPipelineSpecificationWithVersions
> computePipelineSpecificationWithVersions(String documentIdentifierHash,
>       String componentIdentifierHash,
>       String documentIdentifier)
>     {
>       QueuedDocument qd = previousDocuments.get(documentIdentifierHash);
>  // return null. The problem is here.
>       if (qd == null)
>         throw new IllegalArgumentException("Unrecognized document
> identifier: '"+documentIdentifier+"'");
>       return new
> PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
>     }
>
> Best wishes,
>
> Cheng
>
>
>
>
> ------------------------------
> *From:* Karl Wright <da...@gmail.com>
> *Sent:* 12 November 2018 18:46
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Job stuck - WorkerThread functions return null
>
> Hi,
> Have you been modifying the framework code?  If so, I really cannot help
> you.
>
> If you haven't -- it looks like you've got code that is injecting document
> identifiers that are incorrect.  But I will need to see a full stack trace
> to be sure of that.
>
> Thanks,
> Karl
>
>
> On Mon, Nov 12, 2018 at 4:06 AM Cheng Zeng <ze...@hotmail.co.uk> wrote:
>
> Hi Karl,
>
> I am developing my own repository where I borrowed some code from the file
> repository connector. I use my repository connector to crawling documents
> from IBM domino system. I managed to retrieve all the files in the domino,
> however, when I restart my job to recrawl the database in the domino, I've
> got problems with the following code where previousDocuments.get(documentIdentifierHash)
> in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null
> for some of the document ids. As a result, the job got stuck with the
> specific document id.
>
> Could you please tell me how I could fix the problem?
>
>  protected IPipelineSpecificationWithVersions
> computePipelineSpecificationWithVersions(String documentIdentifierHash,
>       String componentIdentifierHash,
>       String documentIdentifier)
>     {
>       QueuedDocument qd = previousDocuments.get(documentIdentifierHash);
>  // return null. The problem is here.
>       if (qd == null)
>         throw new IllegalArgumentException("Unrecognized document
> identifier: '"+documentIdentifier+"'");
>       return new
> PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
>     }
>
>
> Thanks a lot.
>
> Cheng
>
>

Re: Job stuck - WorkerThread functions return null

Posted by Cheng Zeng <ze...@hotmail.co.uk>.
Hi Karl,

Thanks a lot for your replay. I didn't change any code in the framework except my own repository connector.

I found that there five methods which are available to inject document identifiers. Could you please tell me how I should choose the right way to inject the document identifiers.
 activities.addDocumentReference(documentIdentifier);
 activities.addDocumentReference(documentIdentifier, parentIdentifier, relationshipType);
 activities.addDocumentReference(documentIdentifier, parentIdentifier, relationshipType, dataNames, dataValues);
 activities.addDocumentReference(documentIdentifier, parentIdentifier, relationshipType, dataNames, dataValues, originationTime);
 activities.addDocumentReference(documentIdentifier, parentIdentifier, relationshipType, dataNames, dataValues, originationTime, prereqEventNames);

The way I injected document identifiers is as follows.

activities.addDocumentReference(docUri,documentIdentifier,RELATIONSHIP_CHILD);
docUri is the doc url which is supposed to be fetched, e.g. http://domino_server:80/path/dep1/database_name.nsf/api/data/documents
documentIdentifier is the parent url, e.g. http://domino_server:80/path/dep1/database_name.nsf/api/data/documents/unid/B0F9484E94DEA3204825813E001034E1

I am afraid that there is no full stack trace thrown. I have only got the

new IllegalArgumentException("Unrecognized document identifier: '"+documentIdentifier+"'");

with the following code in the WorkerThread.java(org.apache.manifoldcf.crawler.system). I've found the document identifier in the table of "jobqueue" and the dochash in the table of "jobqueue" is matched against the hashcode generated by the hash method.

For some of the document identifiers, previousDocuments.get(documentIdentifierHash) can return the queued document, but for several document identifier,
previousDocuments.get(documentIdentifierHash) return null.

Could you please give me some indication?

protected IPipelineSpecificationWithVersions computePipelineSpecificationWithVersions(String documentIdentifierHash,
      String componentIdentifierHash,
      String documentIdentifier)
    {
      QueuedDocument qd = previousDocuments.get(documentIdentifierHash);  // return null. The problem is here.
      if (qd == null)
        throw new IllegalArgumentException("Unrecognized document identifier: '"+documentIdentifier+"'");
      return new PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
    }

Best wishes,

Cheng




________________________________
From: Karl Wright <da...@gmail.com>
Sent: 12 November 2018 18:46
To: user@manifoldcf.apache.org
Subject: Re: Job stuck - WorkerThread functions return null

Hi,
Have you been modifying the framework code?  If so, I really cannot help you.

If you haven't -- it looks like you've got code that is injecting document identifiers that are incorrect.  But I will need to see a full stack trace to be sure of that.

Thanks,
Karl


On Mon, Nov 12, 2018 at 4:06 AM Cheng Zeng <ze...@hotmail.co.uk>> wrote:
Hi Karl,

I am developing my own repository where I borrowed some code from the file repository connector. I use my repository connector to crawling documents from IBM domino system. I managed to retrieve all the files in the domino, however, when I restart my job to recrawl the database in the domino, I've got problems with the following code where previousDocuments.get(documentIdentifierHash) in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null for some of the document ids. As a result, the job got stuck with the specific document id.

Could you please tell me how I could fix the problem?

 protected IPipelineSpecificationWithVersions computePipelineSpecificationWithVersions(String documentIdentifierHash,
      String componentIdentifierHash,
      String documentIdentifier)
    {
      QueuedDocument qd = previousDocuments.get(documentIdentifierHash);  // return null. The problem is here.
      if (qd == null)
        throw new IllegalArgumentException("Unrecognized document identifier: '"+documentIdentifier+"'");
      return new PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
    }


Thanks a lot.

Cheng

Re: Job stuck - WorkerThread functions return null

Posted by Karl Wright <da...@gmail.com>.
Hi,
Have you been modifying the framework code?  If so, I really cannot help
you.

If you haven't -- it looks like you've got code that is injecting document
identifiers that are incorrect.  But I will need to see a full stack trace
to be sure of that.

Thanks,
Karl


On Mon, Nov 12, 2018 at 4:06 AM Cheng Zeng <ze...@hotmail.co.uk> wrote:

> Hi Karl,
>
> I am developing my own repository where I borrowed some code from the file
> repository connector. I use my repository connector to crawling documents
> from IBM domino system. I managed to retrieve all the files in the domino,
> however, when I restart my job to recrawl the database in the domino, I've
> got problems with the following code where previousDocuments.get(documentIdentifierHash)
> in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null
> for some of the document ids. As a result, the job got stuck with the
> specific document id.
>
> Could you please tell me how I could fix the problem?
>
>  protected IPipelineSpecificationWithVersions
> computePipelineSpecificationWithVersions(String documentIdentifierHash,
>       String componentIdentifierHash,
>       String documentIdentifier)
>     {
>       QueuedDocument qd = previousDocuments.get(documentIdentifierHash);
>  // return null. The problem is here.
>       if (qd == null)
>         throw new IllegalArgumentException("Unrecognized document
> identifier: '"+documentIdentifier+"'");
>       return new
> PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
>     }
>
>
> Thanks a lot.
>
> Cheng
>