You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2021/12/06 10:08:43 UTC

Manifold one doc access problem

Hi All,

I have continued the job of window shares to crawl and index a big number
of documents., approx  70k of  documents. Job ran fine , but at last it was
stuck to accessing/extracting a particular document, when checking simple
history it stated extract multiple times the same doc.

Also tried by ingesting a single file (that particular one). Even then that
job was stuck and was not able to progress further. Status and Job
management tab shows running only when active document left was 1 only.

Also the content length mentioned was 15 MB and the file is 14 MB.
[image: image.png]

What can be the reason for this? How it can be corrected

Thanks
Ritika

Re: Manifold one doc access problem

Posted by Karl Wright <da...@gmail.com>.

It seems clear that document extraction is failing possibly due to a
corrupt document. Tika is throwing an exception but it is not clear if mcf
can reliably detect corrupt document exceptions like this.  It would need
to do that in order to retry less or not at all.


On Wed, Dec 8, 2021, 7:36 AM ritika jain <ri...@gmail.com> wrote:

> Using the Output connector as Elastic.
> Yes simple history shows Job start and then extract only, No indexing
> attempt is been showed.
>
> *Logs are like below*
> '[Worker thread '13'] WARN
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor - Ignoring
> unexpected exception while parsing summary entry DocumentSummaryInformation
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 140147, but 100000 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>         at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
>         at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
>         at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
>         at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:257)
>         at org.apache.poi.hpsf.Property.<init>(Property.java:179)
>         at org.apache.poi.hpsf.Section.<init>(Section.java:241)
>         at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
>         at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>         at
> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
>         at
> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3214)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3065)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2696)
>         at
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:750)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1582)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1547)
>         at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:959)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>
>
>
> On Tue, Dec 7, 2021 at 5:09 PM Karl Wright <da...@gmail.com> wrote:
>
>> Can you look at the manifoldcf log?
>>
>> Retries are not done unless something failed in the pipeline.  Failures
>> should be logged.
>> I am also surprised that indexing attempts are NOT logged.  What output
>> connector are you using?
>>
>> Karl
>>
>>
>> On Tue, Dec 7, 2021 at 2:25 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> Simple History shows continuous retries of extract only.
>>> [image: image.png]
>>>
>>> On Mon, Dec 6, 2021 at 7:08 PM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> The reason for this is that it must be getting an error on indexing and
>>>> causing a retry because of that.
>>>> Can you look up this document in the history and see all the events
>>>> that have taken place with it?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Dec 6, 2021 at 5:09 AM ritika jain <ri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have continued the job of window shares to crawl and index a big
>>>>> number of documents., approx  70k of  documents. Job ran fine , but at last
>>>>> it was stuck to accessing/extracting a particular document, when checking
>>>>> simple history it stated extract multiple times the same doc.
>>>>>
>>>>> Also tried by ingesting a single file (that particular one). Even then
>>>>> that job was stuck and was not able to progress further. Status and Job
>>>>> management tab shows running only when active document left was 1 only.
>>>>>
>>>>> Also the content length mentioned was 15 MB and the file is 14 MB.
>>>>> [image: image.png]
>>>>>
>>>>> What can be the reason for this? How it can be corrected
>>>>>
>>>>> Thanks
>>>>> Ritika
>>>>>
>>>>

Re: Manifold one doc access problem

Posted by ritika jain <ri...@gmail.com>.

Using the Output connector as Elastic.
Yes simple history shows Job start and then extract only, No indexing
attempt is been showed.

*Logs are like below*
'[Worker thread '13'] WARN
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor - Ignoring
unexpected exception while parsing summary entry DocumentSummaryInformation
org.apache.poi.util.RecordFormatException: Tried to allocate an array of
length 140147, but 100000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with
IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:257)
        at org.apache.poi.hpsf.Property.<init>(Property.java:179)
        at org.apache.poi.hpsf.Section.<init>(Section.java:241)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
        at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
        at
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
        at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at
org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
        at
org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3214)
        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3065)
        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2696)
        at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:750)
        at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1582)
        at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1547)
        at
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:959)
        at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)



On Tue, Dec 7, 2021 at 5:09 PM Karl Wright <da...@gmail.com> wrote:

> Can you look at the manifoldcf log?
>
> Retries are not done unless something failed in the pipeline.  Failures
> should be logged.
> I am also surprised that indexing attempts are NOT logged.  What output
> connector are you using?
>
> Karl
>
>
> On Tue, Dec 7, 2021 at 2:25 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Simple History shows continuous retries of extract only.
>> [image: image.png]
>>
>> On Mon, Dec 6, 2021 at 7:08 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> The reason for this is that it must be getting an error on indexing and
>>> causing a retry because of that.
>>> Can you look up this document in the history and see all the events that
>>> have taken place with it?
>>>
>>> Karl
>>>
>>>
>>> On Mon, Dec 6, 2021 at 5:09 AM ritika jain <ri...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have continued the job of window shares to crawl and index a big
>>>> number of documents., approx  70k of  documents. Job ran fine , but at last
>>>> it was stuck to accessing/extracting a particular document, when checking
>>>> simple history it stated extract multiple times the same doc.
>>>>
>>>> Also tried by ingesting a single file (that particular one). Even then
>>>> that job was stuck and was not able to progress further. Status and Job
>>>> management tab shows running only when active document left was 1 only.
>>>>
>>>> Also the content length mentioned was 15 MB and the file is 14 MB.
>>>> [image: image.png]
>>>>
>>>> What can be the reason for this? How it can be corrected
>>>>
>>>> Thanks
>>>> Ritika
>>>>
>>>

Re: Manifold one doc access problem

Posted by Karl Wright <da...@gmail.com>.

Can you look at the manifoldcf log?

Retries are not done unless something failed in the pipeline.  Failures
should be logged.
I am also surprised that indexing attempts are NOT logged.  What output
connector are you using?

Karl


On Tue, Dec 7, 2021 at 2:25 AM ritika jain <ri...@gmail.com> wrote:

> Simple History shows continuous retries of extract only.
> [image: image.png]
>
> On Mon, Dec 6, 2021 at 7:08 PM Karl Wright <da...@gmail.com> wrote:
>
>> The reason for this is that it must be getting an error on indexing and
>> causing a retry because of that.
>> Can you look up this document in the history and see all the events that
>> have taken place with it?
>>
>> Karl
>>
>>
>> On Mon, Dec 6, 2021 at 5:09 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have continued the job of window shares to crawl and index a big
>>> number of documents., approx  70k of  documents. Job ran fine , but at last
>>> it was stuck to accessing/extracting a particular document, when checking
>>> simple history it stated extract multiple times the same doc.
>>>
>>> Also tried by ingesting a single file (that particular one). Even then
>>> that job was stuck and was not able to progress further. Status and Job
>>> management tab shows running only when active document left was 1 only.
>>>
>>> Also the content length mentioned was 15 MB and the file is 14 MB.
>>> [image: image.png]
>>>
>>> What can be the reason for this? How it can be corrected
>>>
>>> Thanks
>>> Ritika
>>>
>>

Re: Manifold one doc access problem

Posted by ritika jain <ri...@gmail.com>.

Simple History shows continuous retries of extract only.
[image: image.png]

On Mon, Dec 6, 2021 at 7:08 PM Karl Wright <da...@gmail.com> wrote:

> The reason for this is that it must be getting an error on indexing and
> causing a retry because of that.
> Can you look up this document in the history and see all the events that
> have taken place with it?
>
> Karl
>
>
> On Mon, Dec 6, 2021 at 5:09 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have continued the job of window shares to crawl and index a big number
>> of documents., approx  70k of  documents. Job ran fine , but at last it was
>> stuck to accessing/extracting a particular document, when checking simple
>> history it stated extract multiple times the same doc.
>>
>> Also tried by ingesting a single file (that particular one). Even then
>> that job was stuck and was not able to progress further. Status and Job
>> management tab shows running only when active document left was 1 only.
>>
>> Also the content length mentioned was 15 MB and the file is 14 MB.
>> [image: image.png]
>>
>> What can be the reason for this? How it can be corrected
>>
>> Thanks
>> Ritika
>>
>

Re: Manifold one doc access problem

Posted by Karl Wright <da...@gmail.com>.

The reason for this is that it must be getting an error on indexing and
causing a retry because of that.
Can you look up this document in the history and see all the events that
have taken place with it?

Karl


On Mon, Dec 6, 2021 at 5:09 AM ritika jain <ri...@gmail.com> wrote:

> Hi All,
>
> I have continued the job of window shares to crawl and index a big number
> of documents., approx  70k of  documents. Job ran fine , but at last it was
> stuck to accessing/extracting a particular document, when checking simple
> history it stated extract multiple times the same doc.
>
> Also tried by ingesting a single file (that particular one). Even then
> that job was stuck and was not able to progress further. Status and Job
> management tab shows running only when active document left was 1 only.
>
> Also the content length mentioned was 15 MB and the file is 14 MB.
> [image: image.png]
>
> What can be the reason for this? How it can be corrected
>
> Thanks
> Ritika
>