You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Frédéric Olier <FO...@wooxo.fr> on 2015/10/21 16:29:05 UTC

[Solr] Error on documents makes ManifoldCF

Hi,

We integrated Solr to ManifoldCF.
We configured Solr to use the OCR engine.

When we crawl documents MCF reads the docs fine and submit them to Solr.

It happens on large files (PDF, images) that the OCR takes too long which leads to MCF request to fail.

The annoying thing is that MCF does not ignore the file.
On the next crawling, the file keeps failing.

How could I tell manifold to skip the file that fails ?

Thanks for your reply.

[TOP 250 des éditeurs]<http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f283087b34/undefined>


[Logo]<http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac1b836/undefined>


Suivez-nous !

[Linkedin]<http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738afa52f/undefined>

[Viadeo]<http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f463fe83/undefined>

[Twitter]<http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b26d01/undefined>

[Googleplus]<http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a19976f79/undefined>




Frédéric OLIER | Responsable de la planification stratégique
33 442 016 891
33 662 635 031

WOOXO
Tél : 0811 140 160
Fax0811 481 507
Immeuble Le Forum - Bât A - 3ème étage
515 av. de la Tramontane
ZAC Athélia IV
13600 LA CIOTAT
FRANCE






Re: [Solr] Error on documents makes ManifoldCF

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Olier,

I think it is onError="skip" defined for entity processors?

https://issues.apache.org/jira/browse/SOLR-7076
Does Extracting request handler have similar config parameter?  

Ahmet



On Wednesday, October 21, 2015 5:46 PM, Karl Wright <da...@gmail.com> wrote:



Hi Frédéric,

There's a flag in the Solr configuration you can set that will cause
exceptions from Solr Cell (Tika) to cause the document to be skipped rather
than causing ManifoldCF to retry the document.  I don't remember what it is
but others have noted it and you can search the mail archive to find it.

Thanks,
Karl


On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
>
>
> We integrated Solr to ManifoldCF.
>
> We configured Solr to use the OCR engine.
>
>
>
> When we crawl documents MCF reads the docs fine and submit them to Solr.
>
>
>
> It happens on large files (PDF, images) that the OCR takes too long which
> leads to MCF request to fail.
>
>
>
> The annoying thing is that MCF does not ignore the file.
>
> On the next crawling, the file keeps failing.
>
>
>
> How could I tell manifold to skip the file that fails ?
>
>
>
> Thanks for your reply.
>
>
>
> [image: TOP 250 des éditeurs]
> <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f283087b34/undefined>
>
> [image: Logo]
> <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac1b836/undefined>
>
> *Suivez-nous !*
>
> [image: Linkedin]
> <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738afa52f/undefined>
>
> [image: Viadeo]
> <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f463fe83/undefined>
>
> [image: Twitter]
> <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b26d01/undefined>
>
> [image: Googleplus]
> <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a19976f79/undefined>
>
> *Frédéric OLIER** | Responsable de la planification stratégique*
>
> * 33 442 016 891 33 662 635 031*
>
> *WOOXO*
> Tél : 0811 140 160
> Fax0811 481 507
> Immeuble Le Forum - Bât A - 3ème étage
> 515 av. de la Tramontane
> ZAC Athélia IV
> 13600 LA CIOTAT
> FRANCE
>
>
>
>
>

Re: [Solr] Error on documents makes ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Hi Frédéric,

What is likely happening is that you have more than one very slow document,
and these are all piling up so that all worker threads are busy
 with them until  one of them times out, and then that document gets tried
again.  You can confirm that picture using the Document Status Report and/or
Simple History Report.

If you can determine why they are so slow, you may be able to take action
to prevent them from being processed in the first place, probably by
limiting the maximum
size of the document indexed, or excluding certain document types, etc.
Can I ask what repository connection you are using?  It is possible that
the timeouts are
in fact coming from the repository side, not the Solr side.

Karl


On Thu, Oct 29, 2015 at 11:56 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi Karl,
>
> I managed to get round my 'out of memory issue' with Solr by tweaking the
> Solr configuration.
>
> Now, I have documents that can take ages to be indexed by Solr.
> I set a reasonable value for the socket timeout of the Solr connector
> (1200 sec).
>
> Still I get timeouts even then.
> If a timeout occurs, the MCF crawling stops.
> If I restart it, the file that timed out gets indexed again... and so on.
>
> What is your recommendation in such situation ?
>
> Many thanks,
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com]
> Envoyé : jeudi 22 octobre 2015 18:23
> À : dev
> Objet : Re: [Solr] Error on documents makes ManifoldCF
>
> Hi Fred,
>
> When a java process runs out of memory in one thread, *all* threads are
> likely impacted.  That's why if you are seeing memory issues you really
> just have to fix them; you can't just ignore the exception and hope for the
> best.
>
> Karl
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi Karl,
> >
> > Indeed, I have this in my logs:
> >
> > MCF:
> >
> > Exception tossed: Repeated service interruptions - failure processing
> > document: Read timed out
> >
> >
> > Solr
> >
> > Error for /datafari-solr/FileShare/update/extract
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> > The file is not that big (7M).
> >
> > Although ignoring the file might not be the 'nicest' solution, is that
> > possible ?
> >
> > I'll investigate on Solr / Tika side to see if I can deactivate the
> > recursive parsing of archive files.
> >
> > Thanks anyway,
> > Fred.
> >
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : jeudi 22 octobre
> > 2015 18:16 À : dev Objet : Re: [Solr] Error on documents makes
> > ManifoldCF
> >
> > Hi Fred,
> >
> > I suspect that you are getting an out-of-memory or out-of-disk error
> > on the Solr side.  That's really bad and you don't just want to make
> > ManifoldCF ignore it.
> >
> > What you can do is limit the maximum size file sent to Solr.  That's a
> > far better fix.
> >
> > Karl
> >
> >
> > On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FO...@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > > I managed to progress on my issues.
> > >
> > > The document (docx) is now skipped as expected when it fails.
> > >
> > > However, I have now another issue.
> > > I have a tar.gz file containing itself 100+ tar.gz files.
> > >
> > > ManifoldCF gets an 500 error from Solr which makes the crawling to
> abort.
> > > I looked at the Solr configuration and due to the hardware used I
> > > won't be able to tweak more the JVM and so on.
> > >
> > > Therefore I'd like to know whether ManifoldCF can be configured to
> > > skipped files for which it gets such an error instead of aborting ?
> > >
> > > Fred.​
> > >
> > >
> > > -----Message d'origine-----
> > > De : Frédéric Olier [mailto:FOlier@wooxo.fr] Envoyé : mercredi 21
> > > octobre 2015 17:51 À : dev@manifoldcf.apache.org Objet : RE: [Solr]
> > > Error on documents makes ManifoldCF
> > >
> > > Hi Karl,
> > >
> > > Many thanks.
> > >
> > > I found the configuration to use:
> > > Here
> > >
> > > http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and
> > > -s
> > > olr-for-files-search/
> > >
> > > Search for "ignoreTikaException"
> > >
> > > I'll test it and see if it fixes my issue.
> > >
> > > Fred​
> > >
> > >
> > > -----Message d'origine-----
> > > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21
> > > octobre
> > > 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes
> > > ManifoldCF
> > >
> > > Standard google searching finds it.
> > >
> > > See:
> > >
> > >
> > > http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox
> > > /% 3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
> > >
> > > Karl
> > >
> > >
> > > On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > I looked here :
> > > > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> > > >
> > > > But there is no 'search' option...
> > > >
> > > > Any idea where I can search what I'm looking for more efficiently ?
> > > >
> > > > Thanks​
> > > >
> > > >
> > > > -----Message d'origine-----
> > > > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21
> > > > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents
> > > > makes ManifoldCF
> > > >
> > > > Hi Frédéric,
> > > >
> > > > There's a flag in the Solr configuration you can set that will
> > > > cause exceptions from Solr Cell (Tika) to cause the document to be
> > > > skipped rather than causing ManifoldCF to retry the document.  I
> > > > don't remember what it is but others have noted it and you can
> > > > search the mail
> > > archive to find it.
> > > >
> > > > Thanks,
> > > > Karl
> > > >
> > > >
> > > > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > >
> > > > >
> > > > > We integrated Solr to ManifoldCF.
> > > > >
> > > > > We configured Solr to use the OCR engine.
> > > > >
> > > > >
> > > > >
> > > > > When we crawl documents MCF reads the docs fine and submit them
> > > > > to
> > > Solr.
> > > > >
> > > > >
> > > > >
> > > > > It happens on large files (PDF, images) that the OCR takes too
> > > > > long which leads to MCF request to fail.
> > > > >
> > > > >
> > > > >
> > > > > The annoying thing is that MCF does not ignore the file.
> > > > >
> > > > > On the next crawling, the file keeps failing.
> > > > >
> > > > >
> > > > >
> > > > > How could I tell manifold to skip the file that fails ?
> > > > >
> > > > >
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > >
> > > > >
> > > > > [image: TOP 250 des éditeurs]
> > > > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f
> > > > > 28
> > > > > 30
> > > > > 87
> > > > > b34/undefined>
> > > > >
> > > > > [image: Logo]
> > > > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-873
> > > > > 0e
> > > > > ac
> > > > > 1b
> > > > > 836/undefined>
> > > > >
> > > > > *Suivez-nous !*
> > > > >
> > > > > [image: Linkedin]
> > > > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8
> > > > > 73
> > > > > 8a
> > > > > fa
> > > > > 52f/undefined>
> > > > >
> > > > > [image: Viadeo]
> > > > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec
> > > > > 6f
> > > > > 46
> > > > > 3f
> > > > > e83/undefined>
> > > > >
> > > > > [image: Twitter]
> > > > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb
> > > > > 9d
> > > > > 3b
> > > > > 26
> > > > > d01/undefined>
> > > > >
> > > > > [image: Googleplus]
> > > > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365
> > > > > a1
> > > > > 99
> > > > > 76
> > > > > f79/undefined>
> > > > >
> > > > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > > > >
> > > > > * 33 442 016 891 33 662 635 031*
> > > > >
> > > > > *WOOXO*
> > > > > Tél : 0811 140 160
> > > > > Fax0811 481 507
> > > > > Immeuble Le Forum - Bât A - 3ème étage
> > > > > 515 av. de la Tramontane
> > > > > ZAC Athélia IV
> > > > > 13600 LA CIOTAT
> > > > > FRANCE
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

RE: [Solr] Error on documents makes ManifoldCF

Posted by Frédéric Olier <FO...@wooxo.fr>.
Hi Karl,

I managed to get round my 'out of memory issue' with Solr by tweaking the Solr configuration.

Now, I have documents that can take ages to be indexed by Solr.
I set a reasonable value for the socket timeout of the Solr connector (1200 sec).

Still I get timeouts even then.
If a timeout occurs, the MCF crawling stops.
If I restart it, the file that timed out gets indexed again... and so on.

What is your recommendation in such situation ?

Many thanks,


-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 22 octobre 2015 18:23
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Hi Fred,

When a java process runs out of memory in one thread, *all* threads are likely impacted.  That's why if you are seeing memory issues you really just have to fix them; you can't just ignore the exception and hope for the best.

Karl


On Thu, Oct 22, 2015 at 12:20 PM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi Karl,
>
> Indeed, I have this in my logs:
>
> MCF:
>
> Exception tossed: Repeated service interruptions - failure processing
> document: Read timed out
>
>
> Solr
>
> Error for /datafari-solr/FileShare/update/extract
> java.lang.OutOfMemoryError: Java heap space
>
>
> The file is not that big (7M).
>
> Although ignoring the file might not be the 'nicest' solution, is that 
> possible ?
>
> I'll investigate on Solr / Tika side to see if I can deactivate the 
> recursive parsing of archive files.
>
> Thanks anyway,
> Fred.
>
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : jeudi 22 octobre 
> 2015 18:16 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Hi Fred,
>
> I suspect that you are getting an out-of-memory or out-of-disk error 
> on the Solr side.  That's really bad and you don't just want to make 
> ManifoldCF ignore it.
>
> What you can do is limit the maximum size file sent to Solr.  That's a 
> far better fix.
>
> Karl
>
>
> On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> > I managed to progress on my issues.
> >
> > The document (docx) is now skipped as expected when it fails.
> >
> > However, I have now another issue.
> > I have a tar.gz file containing itself 100+ tar.gz files.
> >
> > ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> > I looked at the Solr configuration and due to the hardware used I 
> > won't be able to tweak more the JVM and so on.
> >
> > Therefore I'd like to know whether ManifoldCF can be configured to 
> > skipped files for which it gets such an error instead of aborting ?
> >
> > Fred.​
> >
> >
> > -----Message d'origine-----
> > De : Frédéric Olier [mailto:FOlier@wooxo.fr] Envoyé : mercredi 21 
> > octobre 2015 17:51 À : dev@manifoldcf.apache.org Objet : RE: [Solr] 
> > Error on documents makes ManifoldCF
> >
> > Hi Karl,
> >
> > Many thanks.
> >
> > I found the configuration to use:
> > Here
> >
> > http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and
> > -s
> > olr-for-files-search/
> >
> > Search for "ignoreTikaException"
> >
> > I'll test it and see if it fixes my issue.
> >
> > Fred​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> > octobre
> > 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes 
> > ManifoldCF
> >
> > Standard google searching finds it.
> >
> > See:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox
> > /% 3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
> >
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your reply.
> > >
> > > I looked here :
> > > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> > >
> > > But there is no 'search' option...
> > >
> > > Any idea where I can search what I'm looking for more efficiently ?
> > >
> > > Thanks​
> > >
> > >
> > > -----Message d'origine-----
> > > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> > > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents 
> > > makes ManifoldCF
> > >
> > > Hi Frédéric,
> > >
> > > There's a flag in the Solr configuration you can set that will 
> > > cause exceptions from Solr Cell (Tika) to cause the document to be 
> > > skipped rather than causing ManifoldCF to retry the document.  I 
> > > don't remember what it is but others have noted it and you can 
> > > search the mail
> > archive to find it.
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > We integrated Solr to ManifoldCF.
> > > >
> > > > We configured Solr to use the OCR engine.
> > > >
> > > >
> > > >
> > > > When we crawl documents MCF reads the docs fine and submit them 
> > > > to
> > Solr.
> > > >
> > > >
> > > >
> > > > It happens on large files (PDF, images) that the OCR takes too 
> > > > long which leads to MCF request to fail.
> > > >
> > > >
> > > >
> > > > The annoying thing is that MCF does not ignore the file.
> > > >
> > > > On the next crawling, the file keeps failing.
> > > >
> > > >
> > > >
> > > > How could I tell manifold to skip the file that fails ?
> > > >
> > > >
> > > >
> > > > Thanks for your reply.
> > > >
> > > >
> > > >
> > > > [image: TOP 250 des éditeurs]
> > > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f
> > > > 28
> > > > 30
> > > > 87
> > > > b34/undefined>
> > > >
> > > > [image: Logo]
> > > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-873
> > > > 0e
> > > > ac
> > > > 1b
> > > > 836/undefined>
> > > >
> > > > *Suivez-nous !*
> > > >
> > > > [image: Linkedin]
> > > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8
> > > > 73
> > > > 8a
> > > > fa
> > > > 52f/undefined>
> > > >
> > > > [image: Viadeo]
> > > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec
> > > > 6f
> > > > 46
> > > > 3f
> > > > e83/undefined>
> > > >
> > > > [image: Twitter]
> > > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb
> > > > 9d
> > > > 3b
> > > > 26
> > > > d01/undefined>
> > > >
> > > > [image: Googleplus]
> > > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365
> > > > a1
> > > > 99
> > > > 76
> > > > f79/undefined>
> > > >
> > > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > > >
> > > > * 33 442 016 891 33 662 635 031*
> > > >
> > > > *WOOXO*
> > > > Tél : 0811 140 160
> > > > Fax0811 481 507
> > > > Immeuble Le Forum - Bât A - 3ème étage
> > > > 515 av. de la Tramontane
> > > > ZAC Athélia IV
> > > > 13600 LA CIOTAT
> > > > FRANCE
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: [Solr] Error on documents makes ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Hi Fred,

When a java process runs out of memory in one thread, *all* threads are
likely impacted.  That's why if you are seeing memory issues you really
just have to fix them; you can't just ignore the exception and hope for the
best.

Karl


On Thu, Oct 22, 2015 at 12:20 PM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi Karl,
>
> Indeed, I have this in my logs:
>
> MCF:
>
> Exception tossed: Repeated service interruptions - failure processing
> document: Read timed out
>
>
> Solr
>
> Error for /datafari-solr/FileShare/update/extract
> java.lang.OutOfMemoryError: Java heap space
>
>
> The file is not that big (7M).
>
> Although ignoring the file might not be the 'nicest' solution, is that
> possible ?
>
> I'll investigate on Solr / Tika side to see if I can deactivate the
> recursive parsing of archive files.
>
> Thanks anyway,
> Fred.
>
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com]
> Envoyé : jeudi 22 octobre 2015 18:16
> À : dev
> Objet : Re: [Solr] Error on documents makes ManifoldCF
>
> Hi Fred,
>
> I suspect that you are getting an out-of-memory or out-of-disk error on
> the Solr side.  That's really bad and you don't just want to make
> ManifoldCF ignore it.
>
> What you can do is limit the maximum size file sent to Solr.  That's a far
> better fix.
>
> Karl
>
>
> On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> > I managed to progress on my issues.
> >
> > The document (docx) is now skipped as expected when it fails.
> >
> > However, I have now another issue.
> > I have a tar.gz file containing itself 100+ tar.gz files.
> >
> > ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> > I looked at the Solr configuration and due to the hardware used I
> > won't be able to tweak more the JVM and so on.
> >
> > Therefore I'd like to know whether ManifoldCF can be configured to
> > skipped files for which it gets such an error instead of aborting ?
> >
> > Fred.​
> >
> >
> > -----Message d'origine-----
> > De : Frédéric Olier [mailto:FOlier@wooxo.fr] Envoyé : mercredi 21
> > octobre 2015 17:51 À : dev@manifoldcf.apache.org Objet : RE: [Solr]
> > Error on documents makes ManifoldCF
> >
> > Hi Karl,
> >
> > Many thanks.
> >
> > I found the configuration to use:
> > Here
> >
> > http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-s
> > olr-for-files-search/
> >
> > Search for "ignoreTikaException"
> >
> > I'll test it and see if it fixes my issue.
> >
> > Fred​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21
> > octobre
> > 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes
> > ManifoldCF
> >
> > Standard google searching finds it.
> >
> > See:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%
> > 3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
> >
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your reply.
> > >
> > > I looked here :
> > > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> > >
> > > But there is no 'search' option...
> > >
> > > Any idea where I can search what I'm looking for more efficiently ?
> > >
> > > Thanks​
> > >
> > >
> > > -----Message d'origine-----
> > > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21
> > > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents
> > > makes ManifoldCF
> > >
> > > Hi Frédéric,
> > >
> > > There's a flag in the Solr configuration you can set that will cause
> > > exceptions from Solr Cell (Tika) to cause the document to be skipped
> > > rather than causing ManifoldCF to retry the document.  I don't
> > > remember what it is but others have noted it and you can search the
> > > mail
> > archive to find it.
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > We integrated Solr to ManifoldCF.
> > > >
> > > > We configured Solr to use the OCR engine.
> > > >
> > > >
> > > >
> > > > When we crawl documents MCF reads the docs fine and submit them to
> > Solr.
> > > >
> > > >
> > > >
> > > > It happens on large files (PDF, images) that the OCR takes too
> > > > long which leads to MCF request to fail.
> > > >
> > > >
> > > >
> > > > The annoying thing is that MCF does not ignore the file.
> > > >
> > > > On the next crawling, the file keeps failing.
> > > >
> > > >
> > > >
> > > > How could I tell manifold to skip the file that fails ?
> > > >
> > > >
> > > >
> > > > Thanks for your reply.
> > > >
> > > >
> > > >
> > > > [image: TOP 250 des éditeurs]
> > > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f28
> > > > 30
> > > > 87
> > > > b34/undefined>
> > > >
> > > > [image: Logo]
> > > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730e
> > > > ac
> > > > 1b
> > > > 836/undefined>
> > > >
> > > > *Suivez-nous !*
> > > >
> > > > [image: Linkedin]
> > > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b873
> > > > 8a
> > > > fa
> > > > 52f/undefined>
> > > >
> > > > [image: Viadeo]
> > > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f
> > > > 46
> > > > 3f
> > > > e83/undefined>
> > > >
> > > > [image: Twitter]
> > > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d
> > > > 3b
> > > > 26
> > > > d01/undefined>
> > > >
> > > > [image: Googleplus]
> > > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a1
> > > > 99
> > > > 76
> > > > f79/undefined>
> > > >
> > > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > > >
> > > > * 33 442 016 891 33 662 635 031*
> > > >
> > > > *WOOXO*
> > > > Tél : 0811 140 160
> > > > Fax0811 481 507
> > > > Immeuble Le Forum - Bât A - 3ème étage
> > > > 515 av. de la Tramontane
> > > > ZAC Athélia IV
> > > > 13600 LA CIOTAT
> > > > FRANCE
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

RE: [Solr] Error on documents makes ManifoldCF

Posted by Frédéric Olier <FO...@wooxo.fr>.
Hi Karl,

Indeed, I have this in my logs:

MCF: 

Exception tossed: Repeated service interruptions - failure processing document: Read timed out


Solr

Error for /datafari-solr/FileShare/update/extract
java.lang.OutOfMemoryError: Java heap space


The file is not that big (7M).

Although ignoring the file might not be the 'nicest' solution, is that possible ?

I'll investigate on Solr / Tika side to see if I can deactivate the recursive parsing of archive files.

Thanks anyway,
Fred.



-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : jeudi 22 octobre 2015 18:16
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Hi Fred,

I suspect that you are getting an out-of-memory or out-of-disk error on the Solr side.  That's really bad and you don't just want to make ManifoldCF ignore it.

What you can do is limit the maximum size file sent to Solr.  That's a far better fix.

Karl


On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
> I managed to progress on my issues.
>
> The document (docx) is now skipped as expected when it fails.
>
> However, I have now another issue.
> I have a tar.gz file containing itself 100+ tar.gz files.
>
> ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> I looked at the Solr configuration and due to the hardware used I 
> won't be able to tweak more the JVM and so on.
>
> Therefore I'd like to know whether ManifoldCF can be configured to 
> skipped files for which it gets such an error instead of aborting ?
>
> Fred.​
>
>
> -----Message d'origine-----
> De : Frédéric Olier [mailto:FOlier@wooxo.fr] Envoyé : mercredi 21 
> octobre 2015 17:51 À : dev@manifoldcf.apache.org Objet : RE: [Solr] 
> Error on documents makes ManifoldCF
>
> Hi Karl,
>
> Many thanks.
>
> I found the configuration to use:
> Here
>
> http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-s
> olr-for-files-search/
>
> Search for "ignoreTikaException"
>
> I'll test it and see if it fixes my issue.
>
> Fred​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> octobre
> 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Standard google searching finds it.
>
> See:
>
>
> http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%
> 3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
>
> Karl
>
>
> On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> > Thanks for your reply.
> >
> > I looked here :
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> >
> > But there is no 'search' option...
> >
> > Any idea where I can search what I'm looking for more efficiently ?
> >
> > Thanks​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents 
> > makes ManifoldCF
> >
> > Hi Frédéric,
> >
> > There's a flag in the Solr configuration you can set that will cause 
> > exceptions from Solr Cell (Tika) to cause the document to be skipped 
> > rather than causing ManifoldCF to retry the document.  I don't 
> > remember what it is but others have noted it and you can search the 
> > mail
> archive to find it.
> >
> > Thanks,
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > We integrated Solr to ManifoldCF.
> > >
> > > We configured Solr to use the OCR engine.
> > >
> > >
> > >
> > > When we crawl documents MCF reads the docs fine and submit them to
> Solr.
> > >
> > >
> > >
> > > It happens on large files (PDF, images) that the OCR takes too 
> > > long which leads to MCF request to fail.
> > >
> > >
> > >
> > > The annoying thing is that MCF does not ignore the file.
> > >
> > > On the next crawling, the file keeps failing.
> > >
> > >
> > >
> > > How could I tell manifold to skip the file that fails ?
> > >
> > >
> > >
> > > Thanks for your reply.
> > >
> > >
> > >
> > > [image: TOP 250 des éditeurs]
> > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f28
> > > 30
> > > 87
> > > b34/undefined>
> > >
> > > [image: Logo]
> > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730e
> > > ac
> > > 1b
> > > 836/undefined>
> > >
> > > *Suivez-nous !*
> > >
> > > [image: Linkedin]
> > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b873
> > > 8a
> > > fa
> > > 52f/undefined>
> > >
> > > [image: Viadeo]
> > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f
> > > 46
> > > 3f
> > > e83/undefined>
> > >
> > > [image: Twitter]
> > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d
> > > 3b
> > > 26
> > > d01/undefined>
> > >
> > > [image: Googleplus]
> > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a1
> > > 99
> > > 76
> > > f79/undefined>
> > >
> > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > >
> > > * 33 442 016 891 33 662 635 031*
> > >
> > > *WOOXO*
> > > Tél : 0811 140 160
> > > Fax0811 481 507
> > > Immeuble Le Forum - Bât A - 3ème étage
> > > 515 av. de la Tramontane
> > > ZAC Athélia IV
> > > 13600 LA CIOTAT
> > > FRANCE
> > >
> > >
> > >
> > >
> > >
> >
>

Re: [Solr] Error on documents makes ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Hi Fred,

I suspect that you are getting an out-of-memory or out-of-disk error on the
Solr side.  That's really bad and you don't just want to make ManifoldCF
ignore it.

What you can do is limit the maximum size file sent to Solr.  That's a far
better fix.

Karl


On Thu, Oct 22, 2015 at 12:07 PM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
> I managed to progress on my issues.
>
> The document (docx) is now skipped as expected when it fails.
>
> However, I have now another issue.
> I have a tar.gz file containing itself 100+ tar.gz files.
>
> ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
> I looked at the Solr configuration and due to the hardware used I won't be
> able to tweak more the JVM and so on.
>
> Therefore I'd like to know whether ManifoldCF can be configured to skipped
> files for which it gets such an error instead of aborting ?
>
> Fred.​
>
>
> -----Message d'origine-----
> De : Frédéric Olier [mailto:FOlier@wooxo.fr]
> Envoyé : mercredi 21 octobre 2015 17:51
> À : dev@manifoldcf.apache.org
> Objet : RE: [Solr] Error on documents makes ManifoldCF
>
> Hi Karl,
>
> Many thanks.
>
> I found the configuration to use:
> Here
>
> http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
>
> Search for "ignoreTikaException"
>
> I'll test it and see if it fixes my issue.
>
> Fred​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 octobre
> 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes ManifoldCF
>
> Standard google searching finds it.
>
> See:
>
>
> http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%3C55127866020000250008FD2A@slesmail.veritablelp.com%3E
>
> Karl
>
>
> On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> > Thanks for your reply.
> >
> > I looked here :
> > http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
> >
> > But there is no 'search' option...
> >
> > Any idea where I can search what I'm looking for more efficiently ?
> >
> > Thanks​
> >
> >
> > -----Message d'origine-----
> > De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21
> > octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents makes
> > ManifoldCF
> >
> > Hi Frédéric,
> >
> > There's a flag in the Solr configuration you can set that will cause
> > exceptions from Solr Cell (Tika) to cause the document to be skipped
> > rather than causing ManifoldCF to retry the document.  I don't
> > remember what it is but others have noted it and you can search the mail
> archive to find it.
> >
> > Thanks,
> > Karl
> >
> >
> > On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr>
> wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > We integrated Solr to ManifoldCF.
> > >
> > > We configured Solr to use the OCR engine.
> > >
> > >
> > >
> > > When we crawl documents MCF reads the docs fine and submit them to
> Solr.
> > >
> > >
> > >
> > > It happens on large files (PDF, images) that the OCR takes too long
> > > which leads to MCF request to fail.
> > >
> > >
> > >
> > > The annoying thing is that MCF does not ignore the file.
> > >
> > > On the next crawling, the file keeps failing.
> > >
> > >
> > >
> > > How could I tell manifold to skip the file that fails ?
> > >
> > >
> > >
> > > Thanks for your reply.
> > >
> > >
> > >
> > > [image: TOP 250 des éditeurs]
> > > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f2830
> > > 87
> > > b34/undefined>
> > >
> > > [image: Logo]
> > > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac
> > > 1b
> > > 836/undefined>
> > >
> > > *Suivez-nous !*
> > >
> > > [image: Linkedin]
> > > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738a
> > > fa
> > > 52f/undefined>
> > >
> > > [image: Viadeo]
> > > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f46
> > > 3f
> > > e83/undefined>
> > >
> > > [image: Twitter]
> > > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b
> > > 26
> > > d01/undefined>
> > >
> > > [image: Googleplus]
> > > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a199
> > > 76
> > > f79/undefined>
> > >
> > > *Frédéric OLIER** | Responsable de la planification stratégique*
> > >
> > > * 33 442 016 891 33 662 635 031*
> > >
> > > *WOOXO*
> > > Tél : 0811 140 160
> > > Fax0811 481 507
> > > Immeuble Le Forum - Bât A - 3ème étage
> > > 515 av. de la Tramontane
> > > ZAC Athélia IV
> > > 13600 LA CIOTAT
> > > FRANCE
> > >
> > >
> > >
> > >
> > >
> >
>

RE: [Solr] Error on documents makes ManifoldCF

Posted by Frédéric Olier <FO...@wooxo.fr>.
Hi,

I managed to progress on my issues.

The document (docx) is now skipped as expected when it fails.

However, I have now another issue.
I have a tar.gz file containing itself 100+ tar.gz files.

ManifoldCF gets an 500 error from Solr which makes the crawling to abort.
I looked at the Solr configuration and due to the hardware used I won't be able to tweak more the JVM and so on.

Therefore I'd like to know whether ManifoldCF can be configured to skipped files for which it gets such an error instead of aborting ?

Fred.​


-----Message d'origine-----
De : Frédéric Olier [mailto:FOlier@wooxo.fr] 
Envoyé : mercredi 21 octobre 2015 17:51
À : dev@manifoldcf.apache.org
Objet : RE: [Solr] Error on documents makes ManifoldCF

Hi Karl,

Many thanks.

I found the configuration to use:
Here
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/

Search for "ignoreTikaException"

I'll test it and see if it fixes my issue.

Fred​


-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 octobre 2015 17:23 À : dev Objet : Re: [Solr] Error on documents makes ManifoldCF

Standard google searching finds it.

See:

http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%3C55127866020000250008FD2A@slesmail.veritablelp.com%3E

Karl


On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
> Thanks for your reply.
>
> I looked here : 
> http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
>
> But there is no 'search' option...
>
> Any idea where I can search what I'm looking for more efficiently ?
>
> Thanks​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Hi Frédéric,
>
> There's a flag in the Solr configuration you can set that will cause 
> exceptions from Solr Cell (Tika) to cause the document to be skipped 
> rather than causing ManifoldCF to retry the document.  I don't 
> remember what it is but others have noted it and you can search the mail archive to find it.
>
> Thanks,
> Karl
>
>
> On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> >
> >
> > We integrated Solr to ManifoldCF.
> >
> > We configured Solr to use the OCR engine.
> >
> >
> >
> > When we crawl documents MCF reads the docs fine and submit them to Solr.
> >
> >
> >
> > It happens on large files (PDF, images) that the OCR takes too long 
> > which leads to MCF request to fail.
> >
> >
> >
> > The annoying thing is that MCF does not ignore the file.
> >
> > On the next crawling, the file keeps failing.
> >
> >
> >
> > How could I tell manifold to skip the file that fails ?
> >
> >
> >
> > Thanks for your reply.
> >
> >
> >
> > [image: TOP 250 des éditeurs]
> > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f2830
> > 87
> > b34/undefined>
> >
> > [image: Logo]
> > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac
> > 1b
> > 836/undefined>
> >
> > *Suivez-nous !*
> >
> > [image: Linkedin]
> > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738a
> > fa
> > 52f/undefined>
> >
> > [image: Viadeo]
> > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f46
> > 3f
> > e83/undefined>
> >
> > [image: Twitter]
> > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b
> > 26
> > d01/undefined>
> >
> > [image: Googleplus]
> > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a199
> > 76
> > f79/undefined>
> >
> > *Frédéric OLIER** | Responsable de la planification stratégique*
> >
> > * 33 442 016 891 33 662 635 031*
> >
> > *WOOXO*
> > Tél : 0811 140 160
> > Fax0811 481 507
> > Immeuble Le Forum - Bât A - 3ème étage
> > 515 av. de la Tramontane
> > ZAC Athélia IV
> > 13600 LA CIOTAT
> > FRANCE
> >
> >
> >
> >
> >
>

RE: [Solr] Error on documents makes ManifoldCF

Posted by Frédéric Olier <FO...@wooxo.fr>.
Hi Karl,

Many thanks.

I found the configuration to use:
Here
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/

Search for "ignoreTikaException"

I'll test it and see if it fixes my issue.

Fred​


-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 21 octobre 2015 17:23
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Standard google searching finds it.

See:

http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%3C55127866020000250008FD2A@slesmail.veritablelp.com%3E

Karl


On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
> Thanks for your reply.
>
> I looked here : 
> http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
>
> But there is no 'search' option...
>
> Any idea where I can search what I'm looking for more efficiently ?
>
> Thanks​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com] Envoyé : mercredi 21 
> octobre 2015 16:47 À : dev Objet : Re: [Solr] Error on documents makes 
> ManifoldCF
>
> Hi Frédéric,
>
> There's a flag in the Solr configuration you can set that will cause 
> exceptions from Solr Cell (Tika) to cause the document to be skipped 
> rather than causing ManifoldCF to retry the document.  I don't 
> remember what it is but others have noted it and you can search the mail archive to find it.
>
> Thanks,
> Karl
>
>
> On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> >
> >
> > We integrated Solr to ManifoldCF.
> >
> > We configured Solr to use the OCR engine.
> >
> >
> >
> > When we crawl documents MCF reads the docs fine and submit them to Solr.
> >
> >
> >
> > It happens on large files (PDF, images) that the OCR takes too long 
> > which leads to MCF request to fail.
> >
> >
> >
> > The annoying thing is that MCF does not ignore the file.
> >
> > On the next crawling, the file keeps failing.
> >
> >
> >
> > How could I tell manifold to skip the file that fails ?
> >
> >
> >
> > Thanks for your reply.
> >
> >
> >
> > [image: TOP 250 des éditeurs]
> > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f2830
> > 87
> > b34/undefined>
> >
> > [image: Logo]
> > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac
> > 1b
> > 836/undefined>
> >
> > *Suivez-nous !*
> >
> > [image: Linkedin]
> > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738a
> > fa
> > 52f/undefined>
> >
> > [image: Viadeo]
> > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f46
> > 3f
> > e83/undefined>
> >
> > [image: Twitter]
> > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b
> > 26
> > d01/undefined>
> >
> > [image: Googleplus]
> > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a199
> > 76
> > f79/undefined>
> >
> > *Frédéric OLIER** | Responsable de la planification stratégique*
> >
> > * 33 442 016 891 33 662 635 031*
> >
> > *WOOXO*
> > Tél : 0811 140 160
> > Fax0811 481 507
> > Immeuble Le Forum - Bât A - 3ème étage
> > 515 av. de la Tramontane
> > ZAC Athélia IV
> > 13600 LA CIOTAT
> > FRANCE
> >
> >
> >
> >
> >
>

Re: [Solr] Error on documents makes ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Standard google searching finds it.

See:

http://mail-archives.apache.org/mod_mbox/manifoldcf-user/201503.mbox/%3C55127866020000250008FD2A@slesmail.veritablelp.com%3E

Karl


On Wed, Oct 21, 2015 at 11:14 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
> Thanks for your reply.
>
> I looked here : http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/
>
> But there is no 'search' option...
>
> Any idea where I can search what I'm looking for more efficiently ?
>
> Thanks​
>
>
> -----Message d'origine-----
> De : Karl Wright [mailto:daddywri@gmail.com]
> Envoyé : mercredi 21 octobre 2015 16:47
> À : dev
> Objet : Re: [Solr] Error on documents makes ManifoldCF
>
> Hi Frédéric,
>
> There's a flag in the Solr configuration you can set that will cause
> exceptions from Solr Cell (Tika) to cause the document to be skipped rather
> than causing ManifoldCF to retry the document.  I don't remember what it is
> but others have noted it and you can search the mail archive to find it.
>
> Thanks,
> Karl
>
>
> On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:
>
> > Hi,
> >
> >
> >
> > We integrated Solr to ManifoldCF.
> >
> > We configured Solr to use the OCR engine.
> >
> >
> >
> > When we crawl documents MCF reads the docs fine and submit them to Solr.
> >
> >
> >
> > It happens on large files (PDF, images) that the OCR takes too long
> > which leads to MCF request to fail.
> >
> >
> >
> > The annoying thing is that MCF does not ignore the file.
> >
> > On the next crawling, the file keeps failing.
> >
> >
> >
> > How could I tell manifold to skip the file that fails ?
> >
> >
> >
> > Thanks for your reply.
> >
> >
> >
> > [image: TOP 250 des éditeurs]
> > <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f283087
> > b34/undefined>
> >
> > [image: Logo]
> > <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac1b
> > 836/undefined>
> >
> > *Suivez-nous !*
> >
> > [image: Linkedin]
> > <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738afa
> > 52f/undefined>
> >
> > [image: Viadeo]
> > <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f463f
> > e83/undefined>
> >
> > [image: Twitter]
> > <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b26
> > d01/undefined>
> >
> > [image: Googleplus]
> > <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a19976
> > f79/undefined>
> >
> > *Frédéric OLIER** | Responsable de la planification stratégique*
> >
> > * 33 442 016 891 33 662 635 031*
> >
> > *WOOXO*
> > Tél : 0811 140 160
> > Fax0811 481 507
> > Immeuble Le Forum - Bât A - 3ème étage
> > 515 av. de la Tramontane
> > ZAC Athélia IV
> > 13600 LA CIOTAT
> > FRANCE
> >
> >
> >
> >
> >
>

RE: [Solr] Error on documents makes ManifoldCF

Posted by Frédéric Olier <FO...@wooxo.fr>.
Hi,

Thanks for your reply.

I looked here : http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/

But there is no 'search' option...

Any idea where I can search what I'm looking for more efficiently ?

Thanks​


-----Message d'origine-----
De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : mercredi 21 octobre 2015 16:47
À : dev
Objet : Re: [Solr] Error on documents makes ManifoldCF

Hi Frédéric,

There's a flag in the Solr configuration you can set that will cause exceptions from Solr Cell (Tika) to cause the document to be skipped rather than causing ManifoldCF to retry the document.  I don't remember what it is but others have noted it and you can search the mail archive to find it.

Thanks,
Karl


On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
>
>
> We integrated Solr to ManifoldCF.
>
> We configured Solr to use the OCR engine.
>
>
>
> When we crawl documents MCF reads the docs fine and submit them to Solr.
>
>
>
> It happens on large files (PDF, images) that the OCR takes too long 
> which leads to MCF request to fail.
>
>
>
> The annoying thing is that MCF does not ignore the file.
>
> On the next crawling, the file keeps failing.
>
>
>
> How could I tell manifold to skip the file that fails ?
>
>
>
> Thanks for your reply.
>
>
>
> [image: TOP 250 des éditeurs]
> <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f283087
> b34/undefined>
>
> [image: Logo]
> <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac1b
> 836/undefined>
>
> *Suivez-nous !*
>
> [image: Linkedin]
> <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738afa
> 52f/undefined>
>
> [image: Viadeo]
> <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f463f
> e83/undefined>
>
> [image: Twitter]
> <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b26
> d01/undefined>
>
> [image: Googleplus]
> <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a19976
> f79/undefined>
>
> *Frédéric OLIER** | Responsable de la planification stratégique*
>
> * 33 442 016 891 33 662 635 031*
>
> *WOOXO*
> Tél : 0811 140 160
> Fax0811 481 507
> Immeuble Le Forum - Bât A - 3ème étage
> 515 av. de la Tramontane
> ZAC Athélia IV
> 13600 LA CIOTAT
> FRANCE
>
>
>
>
>

Re: [Solr] Error on documents makes ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Hi Frédéric,

There's a flag in the Solr configuration you can set that will cause
exceptions from Solr Cell (Tika) to cause the document to be skipped rather
than causing ManifoldCF to retry the document.  I don't remember what it is
but others have noted it and you can search the mail archive to find it.

Thanks,
Karl


On Wed, Oct 21, 2015 at 10:29 AM, Frédéric Olier <FO...@wooxo.fr> wrote:

> Hi,
>
>
>
> We integrated Solr to ManifoldCF.
>
> We configured Solr to use the OCR engine.
>
>
>
> When we crawl documents MCF reads the docs fine and submit them to Solr.
>
>
>
> It happens on large files (PDF, images) that the OCR takes too long which
> leads to MCF request to fail.
>
>
>
> The annoying thing is that MCF does not ignore the file.
>
> On the next crawling, the file keeps failing.
>
>
>
> How could I tell manifold to skip the file that fails ?
>
>
>
> Thanks for your reply.
>
>
>
> [image: TOP 250 des éditeurs]
> <http://miblink.letsignit.com/r/3808/0a67e322-f9f6-4d7b-89bb-46f283087b34/undefined>
>
> [image: Logo]
> <http://miblink.letsignit.com/r/1794/1a6d2119-9a4e-4a6d-ba13-8730eac1b836/undefined>
>
> *Suivez-nous !*
>
> [image: Linkedin]
> <http://miblink.letsignit.com/r/1795/28939672-253e-4233-8ba0-9b8738afa52f/undefined>
>
> [image: Viadeo]
> <http://miblink.letsignit.com/r/1796/41a2cad7-8cc0-4a99-91f0-dec6f463fe83/undefined>
>
> [image: Twitter]
> <http://miblink.letsignit.com/r/1797/7a7a83af-ce3e-4d9e-83fa-aeb9d3b26d01/undefined>
>
> [image: Googleplus]
> <http://miblink.letsignit.com/r/2870/20ae85fe-1e5f-4e23-b3f8-365a19976f79/undefined>
>
> *Frédéric OLIER** | Responsable de la planification stratégique*
>
> * 33 442 016 891 33 662 635 031*
>
> *WOOXO*
> Tél : 0811 140 160
> Fax0811 481 507
> Immeuble Le Forum - Bât A - 3ème étage
> 515 av. de la Tramontane
> ZAC Athélia IV
> 13600 LA CIOTAT
> FRANCE
>
>
>
>
>