You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk> on 2018/10/18 11:30:12 UTC

Tesseract language

Hi,

I have been trying to use Tesseract through the data-import-handler in Solr and it actually works very well – with English. As the documents are  in Danish, I need to change the language setting in Tesseract to Danish as well, is that possible from Solr?

I was using the update/extract-handler to import single files into Solr, and it worked for a single file, how would I implement several files from a file-system?

Here is the request-handler I used:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">false</str>
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>
  </requestHandler>


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk<ma...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.

SV: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Alex,

Thanks again for your reply, much appreciated.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-----Oprindelig meddelelse-----
Fra: Alexandre Rafalovitch <ar...@gmail.com>
Sendt: 21. oktober 2018 19:13
Til: solr-user <so...@lucene.apache.org>
Emne: Re: Tesseract language

Usually, we just say to do a custom solution using SolrJ client to connect. This gives you maximum flexibility and allows to integrate Tika either inside your code or as a server. Latest Tika actually has some off-thread handling I believe, to make it safer to embed.

For DIH alternatives, if you want configuration over custom code, you could look at something like Apache NiFI. It can push data into Solr.
Obviously it is a bigger solution, but it is correspondingly more robust too.

Regards,
   Alex.
On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> Hi Alexandre,
>
> Thanks for your reply.
>
> Yes right now it is just for testing the possibilities of Solr and Tesseract.
>
> I will take a look at the Tika documentation to see if I can make it work.
>
> You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
>
> Best regards
>
> Martin Frank Hansen
>
> -----Oprindelig meddelelse-----
> Fra: Alexandre Rafalovitch <ar...@gmail.com>
> Sendt: 21. oktober 2018 16:26
> Til: solr-user <so...@lucene.apache.org>
> Emne: Re: Tesseract language
>
> There is a couple of things mixed in here:
> 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> You still need to check with Tika documentation with Tesseract can take its configuration from the parseContext file.
> 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
>
> I hope this helps,
>     Alex.
>
> On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> > Hi again,
> >
> >
> >
> > Is there anyone who has some experience of using Tesseract’s OCR
> > module within Solr? The files I am trying to read into Solr is
> > Danish Tiff documents.
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> >
> >
> > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > *Sendt:* 18. oktober 2018 13:30
> > *Til:* solr-user@lucene.apache.org
> > *Emne:* Tesseract language
> >
> >
> >
> > Hi,
> >
> > I have been trying to use Tesseract through the data-import-handler
> > in Solr and it actually works very well – with English. As the
> > documents are in Danish, I need to change the language setting in
> > Tesseract to Danish as well, is that possible from Solr?
> >
> >
> >
> > I was using the update/extract-handler to import single files into
> > Solr, and it worked for a single file, how would I implement several
> > files from a file-system?
> >
> >
> >
> > Here is the request-handler I used:
> >
> >
> >
> > <requestHandler name="/update/extract"
> >
> >                   startup="lazy"
> >
> >                   class="solr.extraction.ExtractingRequestHandler" >
> >
> >     <lst name="defaults">
> >
> >       <str name="lowernames">false</str>
> >
> >       <str name="uprefix">ignored_</str>
> >
> >       <str name="captureAttr">true</str>
> >
> >     </lst>
> >
> >   </requestHandler>
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder du KMD’s Privatlivspolitik
> > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can
> > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > er fri for virus og andre fejl, som kan påvirke computeren eller
> > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som
> > er opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information.
> > If you have received this message by mistake, please inform the
> > sender of the mistake by sending a reply, then delete the message
> > from your system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free
> > from viruses and other errors that might affect the computer or
> > it-system where it is received and read, the recipient opens the message at his or her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >

SV: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Gus,

Thank you so much! I will definitely take a look at it during the day.


Martin Frank Hansen,

-----Oprindelig meddelelse-----
Fra: Gus Heck <gu...@gmail.com>
Sendt: 22. oktober 2018 00:06
Til: solr-user@lucene.apache.org
Emne: Re: Tesseract language

Hi Martin,

I wrote a framework (https://github.com/nsoft/jesterj) that is meant to help with small to medium custom solutions It's not (yet) ready for cases where you need multiple machines feeding data, but so long as a single box can do the work it should be useful. It has a basic Tika stage which is ripe for enhancement. The example in the project uses Tika to extract text from Shakespeare's plays, though I'll admit that the Tika processor class it has not yet been given the full set of configuration options.  Fleshing that out is on the list of things to do and would be easy and welcome as a contribution (https://github.com/nsoft/jesterj/issues/74).

-Gus


On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it
> work.
> >
> > You said that DIH are not recommended for production usage, what is
> > the
> recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is
> > great
> for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files
> can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr
> > but
> specifying parseContent.config file as shown at the link and described
> further down in the same document:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can
> > take
> its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler
> > can
> iterate through files and then - as a nested entity - feed it to Tika
> processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is
> also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > language setting in Tesseract to Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement
> > > several files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler"
> > > >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > > outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller
> kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential
> > > information. If you have received this message by mistake, please
> > > inform the sender of the mistake by sending a reply, then delete
> > > the message from your system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the
> > > message at his
> or her own risk.
> > > We assume no responsibility for any loss or damage arising from
> > > the receipt or use of this message.
> > >
>


--
http://www.the111shift.com

Re: Tesseract language

Posted by Gus Heck <gu...@gmail.com>.
Hi Martin,

I wrote a framework (https://github.com/nsoft/jesterj) that is meant to
help with small to medium custom solutions It's not (yet) ready for cases
where you need multiple machines feeding data, but so long as a single box
can do the work it should be useful. It has a basic Tika stage which is
ripe for enhancement. The example in the project uses Tika to extract text
from Shakespeare's plays, though I'll admit that the Tika processor class
it has not yet been given the full set of configuration options.  Fleshing
that out is on the list of things to do and would be easy and welcome as a
contribution (https://github.com/nsoft/jesterj/issues/74).

-Gus


On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it
> work.
> >
> > You said that DIH are not recommended for production usage, what is the
> recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great
> for a quick test, just like you did it, but going to production, running it
> externally is better. Tika - especially with large files can use up a lot
> of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but
> specifying parseContent.config file as shown at the link and described
> further down in the same document:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can take
> its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can
> iterate through files and then - as a nested entity - feed it to Tika
> processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is
> also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is Danish
> > > Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail mhq@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the data-import-handler in
> > > Solr and it actually works very well – with English. As the documents
> > > are in Danish, I need to change the language setting in Tesseract to
> > > Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement several
> > > files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler" >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail mhq@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can read
> > > KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> > > we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > > beder vi dig slette e-mailen i dit system uden at videresende eller
> kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > > er fri for virus og andre fejl, som kan påvirke computeren eller
> > > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> > > opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information. If
> > > you have received this message by mistake, please inform the sender of
> > > the mistake by sending a reply, then delete the message from your
> > > system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free from
> > > viruses and other errors that might affect the computer or it-system
> > > where it is received and read, the recipient opens the message at his
> or her own risk.
> > > We assume no responsibility for any loss or damage arising from the
> > > receipt or use of this message.
> > >
>


-- 
http://www.the111shift.com

RE: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Tim and Rohan,

Really appreciate your help, and I finally made it work (without tess4j).

It was the path-environment variable which had a wrong setting. Instead setting the path of TESSDATA_PREFIX to  'Tesseract-OCR/tessdata' I changed it to the parent folder 'Tesseract-OCR' and now it works for Danish.

Thanks again for helping.

Best regards

Martin

-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 27. oktober 2018 14:37
To: solr-user@lucene.apache.org; user@tika.apache.org
Subject: Re: Tesseract language

Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can be improved?

    Best,
       Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat <ro...@gmail.com> wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images
> within PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ)
> <MH...@kmd.dk>
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?
> > I will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -----Original Message-----
> > From: Rohan Kasat <ro...@gmail.com>
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and
> > use give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ)
> > <MH...@kmd.dk>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan
> > > -l dan`, I got an error message so I downloaded "dan.traineddata"
> > > and added it to the Tesseract-OCR/tessdata folder. Furthermore I
> > > added the 'TESSDATA_PREFIX' variable to the path-variables
> > > pointing to "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I
> > > can't make the code work in Java, not even with default settings
> > > (which I could before). Am I missing something or just mixing some things up?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language
> > > libraries you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > >             File file = new File(pathfilename);
> > > >
> > > >             Metadata meta = new Metadata();
> > > >
> > > >             InputStream stream = TikaInputStream.get(file);
> > > >
> > > >             Parser parser = new AutoDetectParser();
> > > >             BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > >             TesseractOCRConfig config = new TesseractOCRConfig();
> > > >             config.setLanguage("dan"); // code works if this
> > > > phrase is
> > > commented out.
> > > >
> > > >             ParseContext parseContext = new ParseContext();
> > > >
> > > >              parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > >             parser.parse(stream, handler, meta, parseContext);
> > > >             System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -----Original Message-----
> > > > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > Sent: 22. oktober 2018 07:58
> > <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g>
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>
> > > ct
> > > language
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks for the help! I will take a look at it.
> > > >
> > > >
> > > > Martin Frank Hansen, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Erick Erickson <er...@gmail.com>
> > > > Sendt: 21. oktober 2018 22:49
> > > > Til: solr-user <so...@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > > Rip
> > > the RDBMS parts out....
> > > >
> > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > > arafalov@gmail.com> wrote:
> > > > >
> > > > > Usually, we just say to do a custom solution using SolrJ
> > > > > client to connect. This gives you maximum flexibility and
> > > > > allows to integrate Tika either inside your code or as a
> > > > > server. Latest Tika actually has some off-thread handling I
> > > > > believe, to make it safer
> to
> > embed.
> > > > >
> > > > > For DIH alternatives, if you want configuration over custom
> > > > > code, you could look at something like Apache NiFI. It can
> > > > > push data into
> > > Solr.
> > > > > Obviously it is a bigger solution, but it is correspondingly
> > > > > more robust too.
> > > > >
> > > > > Regards,
> > > > >    Alex.
> > > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > Hi Alexandre,
> > > > > >
> > > > > > Thanks for your reply.
> > > > > >
> > > > > > Yes right now it is just for testing the possibilities of
> > > > > > Solr and
> > > Tesseract.
> > > > > >
> > > > > > I will take a look at the Tika documentation to see if I can
> > > > > > make it
> > > work.
> > > > > >
> > > > > > You said that DIH are not recommended for production usage,
> > > > > > what is
> > > the recommended method(s) to upload data to a Solr instance?
> > > > > >
> > > > > > Best regards
> > > > > >
> > > > > > Martin Frank Hansen
> > > > > >
> > > > > > -----Oprindelig meddelelse-----
> > > > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > > > Sendt: 21. oktober 2018 16:26
> > > > > > Til: solr-user <so...@lucene.apache.org>
> > > > > > Emne: Re: Tesseract language
> > > > > >
> > > > > > There is a couple of things mixed in here:
> > > > > > 1) Extract handler is not recommended for production usage.
> > > > > > It is
> > > great for a quick test, just like you did it, but going to
> > > production, running it externally is better. Tika - especially
> > > with large files can use up a lot of memory and trip up the Solr
> > > instance it is running
> > within.
> > > > > > 2) If you are still just testing, you can configure Tika
> > > > > > within Solr
> > > but specifying parseContent.config file as shown at the link and
> > > described further down in the same document:
> > > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with
> > > > > > -sol
> > > > > > r-
> > > > > > ce
> > > > > > ll-using-apache-tika.html#configuring-the-solr-extractingreq
> > > > > > uest ha nd ler You still need to check with Tika
> > > > > > documentation with Tesseract can take its configuration from
> > > > > > the parseContext file.
> > > > > > 3) If you are still testing with multiple files, Data Import
> > > > > > Handler
> > > can iterate through files and then - as a nested entity - feed it
> > > to Tika processor for further extraction. I think one of the
> > > examples
> shows
> > that.
> > > > > > However, I am not sure you can pass parseContext that way
> > > > > > and DIH is
> > > also not recommended for production.
> > > > > >
> > > > > > I hope this helps,
> > > > > >     Alex.
> > > > > >
> > > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > > Hi again,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Is there anyone who has some experience of using
> > > > > > > Tesseract’s OCR module within Solr? The files I am trying
> > > > > > > to read into Solr is Danish Tiff documents.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk
> > > > > > > Web www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > > > *Sendt:* 18. oktober
> > <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>20
> > 18
> > 13:30
> > > > > > > *Til:* solr-user@lucene.apache.org
> > > > > > > *Emne:* Tesseract language
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have been trying to use Tesseract through the
> > > > > > > data-import-handler in Solr and it actually works very
> > > > > > > well – with English. As the documents are in Danish, I
> > > > > > > need to change the language setting in Tesseract to
> > > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>
> > > Dani
> > > sh
> > > as well, is that possible from Solr?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I was using the update/extract-handler to import single
> > > > > > > files into Solr, and it worked for a single file, how
> > > > > > > would I implement several files from a file-system?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Here is the request-handler I used:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > <requestHandler name="/update/extract"
> > > > > > >
> > > > > > >                   startup="lazy"
> > > > > > >
> > > > > > >
> >  class="solr.extraction.ExtractingRequestHandler"
> > > > > > > >
> > > > > > >
> > > > > > >     <lst name="defaults">
> > > > > > >
> > > > > > >       <str name="lowernames">false</str>
> > > > > > >
> > > > > > >       <str name="uprefix">ignored_</str>
> > > > > > >
> > > > > > >       <str name="captureAttr">true</str>
> > > > > > >
> > > > > > >     </lst>
> > > > > > >
> > > > > > >   </requestHandler>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk
> > > > > > > Web www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > > > Her finder du KMD’s Privatlivspolitik
> > > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> > > > > > > hvordan vi
> > > behandler oplysninger om dig.
> > > > > > >
> > > > > > > Protection of your personal data is important to us. Here
> > > > > > > you can read KMD’s Privacy Policy
> > > > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > > > outlining how we process your personal data.
> > > > > > >
> > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde
> > > > > > > fortrolig
> > > information.
> > > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > > > venligst informere afsender om fejlen ved at bruge
> > svarfunktionen.
> > > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den.
> > > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > > > overbevisning er fri for virus og andre fejl, som kan
> > > > > > > påvirke computeren eller it-systemet, hvori den modtages
> > > > > > > og læses, åbnes den på modtagerens eget ansvar. Vi påtager
> > > > > > > os ikke noget ansvar for tab og skade, som er opstået i
> > > > > > > forbindelse med at modtage og
> > > bruge e-mailen.
> > > > > > >
> > > > > > > Please note that this message may contain confidential
> > > > > > > information. If you have received this message by mistake,
> > > > > > > please inform the sender of the mistake by sending a
> > > > > > > reply, then delete the message from your system without
> > > > > > > making, distributing
> > > or retaining any copies of it.
> > > > > > > Although we believe that the message and any attachments
> > > > > > > are free from viruses and other errors that might affect
> > > > > > > the computer or it-system where it is received and read,
> > > > > > > the recipient
> > > opens the message at his or her own risk.
> > > > > > > We assume no responsibility for any loss or damage arising
> > > > > > > from the receipt or use of this message.
> > > > > > >
> > >
> > --
> >
> > *Regards,Rohan Kasat*
> >
> --
>
> *Regards,Rohan Kasat*
>

Re: Tesseract language

Posted by Tim Allison <ta...@apache.org>.
Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can
be improved?

    Best,
       Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat <ro...@gmail.com> wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images within
> PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?  I
> > will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -----Original Message-----
> > From: Rohan Kasat <ro...@gmail.com>
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and use
> > give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > > dan`, I got an error message so I downloaded "dan.traineddata" and
> > > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > > "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I can't
> > > make the code work in Java, not even with default settings (which I
> > > could before). Am I missing something or just mixing some things up?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language libraries
> > > you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > >             File file = new File(pathfilename);
> > > >
> > > >             Metadata meta = new Metadata();
> > > >
> > > >             InputStream stream = TikaInputStream.get(file);
> > > >
> > > >             Parser parser = new AutoDetectParser();
> > > >             BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > >             TesseractOCRConfig config = new TesseractOCRConfig();
> > > >             config.setLanguage("dan"); // code works if this phrase
> > > > is
> > > commented out.
> > > >
> > > >             ParseContext parseContext = new ParseContext();
> > > >
> > > >              parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > >             parser.parse(stream, handler, meta, parseContext);
> > > >             System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -----Original Message-----
> > > > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > Sent: 22. oktober 2018 07:58
> > <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g>
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct
> > > language
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks for the help! I will take a look at it.
> > > >
> > > >
> > > > Martin Frank Hansen, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Erick Erickson <er...@gmail.com>
> > > > Sendt: 21. oktober 2018 22:49
> > > > Til: solr-user <so...@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > > Rip
> > > the RDBMS parts out....
> > > >
> > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > > arafalov@gmail.com> wrote:
> > > > >
> > > > > Usually, we just say to do a custom solution using SolrJ client to
> > > > > connect. This gives you maximum flexibility and allows to
> > > > > integrate Tika either inside your code or as a server. Latest Tika
> > > > > actually has some off-thread handling I believe, to make it safer
> to
> > embed.
> > > > >
> > > > > For DIH alternatives, if you want configuration over custom code,
> > > > > you could look at something like Apache NiFI. It can push data
> > > > > into
> > > Solr.
> > > > > Obviously it is a bigger solution, but it is correspondingly more
> > > > > robust too.
> > > > >
> > > > > Regards,
> > > > >    Alex.
> > > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > Hi Alexandre,
> > > > > >
> > > > > > Thanks for your reply.
> > > > > >
> > > > > > Yes right now it is just for testing the possibilities of Solr
> > > > > > and
> > > Tesseract.
> > > > > >
> > > > > > I will take a look at the Tika documentation to see if I can
> > > > > > make it
> > > work.
> > > > > >
> > > > > > You said that DIH are not recommended for production usage, what
> > > > > > is
> > > the recommended method(s) to upload data to a Solr instance?
> > > > > >
> > > > > > Best regards
> > > > > >
> > > > > > Martin Frank Hansen
> > > > > >
> > > > > > -----Oprindelig meddelelse-----
> > > > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > > > Sendt: 21. oktober 2018 16:26
> > > > > > Til: solr-user <so...@lucene.apache.org>
> > > > > > Emne: Re: Tesseract language
> > > > > >
> > > > > > There is a couple of things mixed in here:
> > > > > > 1) Extract handler is not recommended for production usage. It
> > > > > > is
> > > great for a quick test, just like you did it, but going to production,
> > > running it externally is better. Tika - especially with large files
> > > can use up a lot of memory and trip up the Solr instance it is running
> > within.
> > > > > > 2) If you are still just testing, you can configure Tika within
> > > > > > Solr
> > > but specifying parseContent.config file as shown at the link and
> > > described further down in the same document:
> > > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-sol
> > > > > > r-
> > > > > > ce
> > > > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequest
> > > > > > ha nd ler You still need to check with Tika documentation with
> > > > > > Tesseract can take its configuration from the parseContext file.
> > > > > > 3) If you are still testing with multiple files, Data Import
> > > > > > Handler
> > > can iterate through files and then - as a nested entity - feed it to
> > > Tika processor for further extraction. I think one of the examples
> shows
> > that.
> > > > > > However, I am not sure you can pass parseContext that way and
> > > > > > DIH is
> > > also not recommended for production.
> > > > > >
> > > > > > I hope this helps,
> > > > > >     Alex.
> > > > > >
> > > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > > Hi again,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Is there anyone who has some experience of using Tesseract’s
> > > > > > > OCR module within Solr? The files I am trying to read into
> > > > > > > Solr is Danish Tiff documents.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > > www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > > > *Sendt:* 18. oktober
> > <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>2018
> > 13:30
> > > > > > > *Til:* solr-user@lucene.apache.org
> > > > > > > *Emne:* Tesseract language
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have been trying to use Tesseract through the
> > > > > > > data-import-handler in Solr and it actually works very well –
> > > > > > > with English. As the documents are in Danish, I need to change
> > > > > > > the language setting in Tesseract to
> > > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Dani
> > > sh
> > > as well, is that possible from Solr?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I was using the update/extract-handler to import single files
> > > > > > > into Solr, and it worked for a single file, how would I
> > > > > > > implement several files from a file-system?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Here is the request-handler I used:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > <requestHandler name="/update/extract"
> > > > > > >
> > > > > > >                   startup="lazy"
> > > > > > >
> > > > > > >
> >  class="solr.extraction.ExtractingRequestHandler"
> > > > > > > >
> > > > > > >
> > > > > > >     <lst name="defaults">
> > > > > > >
> > > > > > >       <str name="lowernames">false</str>
> > > > > > >
> > > > > > >       <str name="uprefix">ignored_</str>
> > > > > > >
> > > > > > >       <str name="captureAttr">true</str>
> > > > > > >
> > > > > > >     </lst>
> > > > > > >
> > > > > > >   </requestHandler>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > > www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > > > Her finder du KMD’s Privatlivspolitik
> > > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan
> > > > > > > vi
> > > behandler oplysninger om dig.
> > > > > > >
> > > > > > > Protection of your personal data is important to us. Here you
> > > > > > > can read KMD’s Privacy Policy
> > > > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > > > outlining how we process your personal data.
> > > > > > >
> > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > > information.
> > > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > > > venligst informere afsender om fejlen ved at bruge
> > svarfunktionen.
> > > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den.
> > > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > > > > computeren eller it-systemet, hvori den modtages og læses,
> > > > > > > åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget
> > > > > > > ansvar for tab og skade, som er opstået i forbindelse med at
> > > > > > > modtage og
> > > bruge e-mailen.
> > > > > > >
> > > > > > > Please note that this message may contain confidential
> > > > > > > information. If you have received this message by mistake,
> > > > > > > please inform the sender of the mistake by sending a reply,
> > > > > > > then delete the message from your system without making,
> > > > > > > distributing
> > > or retaining any copies of it.
> > > > > > > Although we believe that the message and any attachments are
> > > > > > > free from viruses and other errors that might affect the
> > > > > > > computer or it-system where it is received and read, the
> > > > > > > recipient
> > > opens the message at his or her own risk.
> > > > > > > We assume no responsibility for any loss or damage arising
> > > > > > > from the receipt or use of this message.
> > > > > > >
> > >
> > --
> >
> > *Regards,Rohan Kasat*
> >
> --
>
> *Regards,Rohan Kasat*
>

Re: Tesseract language

Posted by Tim Allison <ta...@apache.org>.
Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can
be improved?

    Best,
       Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat <ro...@gmail.com> wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images within
> PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?  I
> > will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -----Original Message-----
> > From: Rohan Kasat <ro...@gmail.com>
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and use
> > give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > > dan`, I got an error message so I downloaded "dan.traineddata" and
> > > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > > "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I can't
> > > make the code work in Java, not even with default settings (which I
> > > could before). Am I missing something or just mixing some things up?
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Tim Allison <ta...@apache.org>
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language libraries
> > > you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > >             File file = new File(pathfilename);
> > > >
> > > >             Metadata meta = new Metadata();
> > > >
> > > >             InputStream stream = TikaInputStream.get(file);
> > > >
> > > >             Parser parser = new AutoDetectParser();
> > > >             BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > >             TesseractOCRConfig config = new TesseractOCRConfig();
> > > >             config.setLanguage("dan"); // code works if this phrase
> > > > is
> > > commented out.
> > > >
> > > >             ParseContext parseContext = new ParseContext();
> > > >
> > > >              parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > >             parser.parse(stream, handler, meta, parseContext);
> > > >             System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope that someone can help here.
> > > >
> > > > -----Original Message-----
> > > > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > Sent: 22. oktober 2018 07:58
> > <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g>
> > > > To: solr-user@lucene.apache.org
> > > > Subject: SV: Tessera
> > > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct
> > > language
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks for the help! I will take a look at it.
> > > >
> > > >
> > > > Martin Frank Hansen, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Erick Erickson <er...@gmail.com>
> > > > Sendt: 21. oktober 2018 22:49
> > > > Til: solr-user <so...@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > > Rip
> > > the RDBMS parts out....
> > > >
> > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > > arafalov@gmail.com> wrote:
> > > > >
> > > > > Usually, we just say to do a custom solution using SolrJ client to
> > > > > connect. This gives you maximum flexibility and allows to
> > > > > integrate Tika either inside your code or as a server. Latest Tika
> > > > > actually has some off-thread handling I believe, to make it safer
> to
> > embed.
> > > > >
> > > > > For DIH alternatives, if you want configuration over custom code,
> > > > > you could look at something like Apache NiFI. It can push data
> > > > > into
> > > Solr.
> > > > > Obviously it is a bigger solution, but it is correspondingly more
> > > > > robust too.
> > > > >
> > > > > Regards,
> > > > >    Alex.
> > > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > Hi Alexandre,
> > > > > >
> > > > > > Thanks for your reply.
> > > > > >
> > > > > > Yes right now it is just for testing the possibilities of Solr
> > > > > > and
> > > Tesseract.
> > > > > >
> > > > > > I will take a look at the Tika documentation to see if I can
> > > > > > make it
> > > work.
> > > > > >
> > > > > > You said that DIH are not recommended for production usage, what
> > > > > > is
> > > the recommended method(s) to upload data to a Solr instance?
> > > > > >
> > > > > > Best regards
> > > > > >
> > > > > > Martin Frank Hansen
> > > > > >
> > > > > > -----Oprindelig meddelelse-----
> > > > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > > > Sendt: 21. oktober 2018 16:26
> > > > > > Til: solr-user <so...@lucene.apache.org>
> > > > > > Emne: Re: Tesseract language
> > > > > >
> > > > > > There is a couple of things mixed in here:
> > > > > > 1) Extract handler is not recommended for production usage. It
> > > > > > is
> > > great for a quick test, just like you did it, but going to production,
> > > running it externally is better. Tika - especially with large files
> > > can use up a lot of memory and trip up the Solr instance it is running
> > within.
> > > > > > 2) If you are still just testing, you can configure Tika within
> > > > > > Solr
> > > but specifying parseContent.config file as shown at the link and
> > > described further down in the same document:
> > > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-sol
> > > > > > r-
> > > > > > ce
> > > > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequest
> > > > > > ha nd ler You still need to check with Tika documentation with
> > > > > > Tesseract can take its configuration from the parseContext file.
> > > > > > 3) If you are still testing with multiple files, Data Import
> > > > > > Handler
> > > can iterate through files and then - as a nested entity - feed it to
> > > Tika processor for further extraction. I think one of the examples
> shows
> > that.
> > > > > > However, I am not sure you can pass parseContext that way and
> > > > > > DIH is
> > > also not recommended for production.
> > > > > >
> > > > > > I hope this helps,
> > > > > >     Alex.
> > > > > >
> > > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > > > <MH...@kmd.dk>
> > > wrote:
> > > > > >
> > > > > > > Hi again,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Is there anyone who has some experience of using Tesseract’s
> > > > > > > OCR module within Solr? The files I am trying to read into
> > > > > > > Solr is Danish Tiff documents.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > > www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > > > *Sendt:* 18. oktober
> > <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>2018
> > 13:30
> > > > > > > *Til:* solr-user@lucene.apache.org
> > > > > > > *Emne:* Tesseract language
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have been trying to use Tesseract through the
> > > > > > > data-import-handler in Solr and it actually works very well –
> > > > > > > with English. As the documents are in Danish, I need to change
> > > > > > > the language setting in Tesseract to
> > > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Dani
> > > sh
> > > as well, is that possible from Solr?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I was using the update/extract-handler to import single files
> > > > > > > into Solr, and it worked for a single file, how would I
> > > > > > > implement several files from a file-system?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Here is the request-handler I used:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > <requestHandler name="/update/extract"
> > > > > > >
> > > > > > >                   startup="lazy"
> > > > > > >
> > > > > > >
> >  class="solr.extraction.ExtractingRequestHandler"
> > > > > > > >
> > > > > > >
> > > > > > >     <lst name="defaults">
> > > > > > >
> > > > > > >       <str name="lowernames">false</str>
> > > > > > >
> > > > > > >       <str name="uprefix">ignored_</str>
> > > > > > >
> > > > > > >       <str name="captureAttr">true</str>
> > > > > > >
> > > > > > >     </lst>
> > > > > > >
> > > > > > >   </requestHandler>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > > >
> > > > > > > Data, IM & Analytics
> > > > > > >
> > > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > > >
> > > > > > >
> > > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > > www.kmd.dk Mobil +4525571418
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > > > Her finder du KMD’s Privatlivspolitik
> > > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan
> > > > > > > vi
> > > behandler oplysninger om dig.
> > > > > > >
> > > > > > > Protection of your personal data is important to us. Here you
> > > > > > > can read KMD’s Privacy Policy
> > > > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > > > outlining how we process your personal data.
> > > > > > >
> > > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > > information.
> > > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > > > venligst informere afsender om fejlen ved at bruge
> > svarfunktionen.
> > > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den.
> > > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > > > > computeren eller it-systemet, hvori den modtages og læses,
> > > > > > > åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget
> > > > > > > ansvar for tab og skade, som er opstået i forbindelse med at
> > > > > > > modtage og
> > > bruge e-mailen.
> > > > > > >
> > > > > > > Please note that this message may contain confidential
> > > > > > > information. If you have received this message by mistake,
> > > > > > > please inform the sender of the mistake by sending a reply,
> > > > > > > then delete the message from your system without making,
> > > > > > > distributing
> > > or retaining any copies of it.
> > > > > > > Although we believe that the message and any attachments are
> > > > > > > free from viruses and other errors that might affect the
> > > > > > > computer or it-system where it is received and read, the
> > > > > > > recipient
> > > opens the message at his or her own risk.
> > > > > > > We assume no responsibility for any loss or damage arising
> > > > > > > from the receipt or use of this message.
> > > > > > >
> > >
> > --
> >
> > *Regards,Rohan Kasat*
> >
> --
>
> *Regards,Rohan Kasat*
>

Re: Tesseract language

Posted by Rohan Kasat <ro...@gmail.com>.
I used tess4j for image formats and Tika for scanned PDFs and images within
PDFs.

Regards,
Rohan Kasat

On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Rohan,
>
> Thanks for your reply, are you using tess4j with Tika or on its own?  I
> will take a look at tess4j if I can't make it work with Tika alone.
>
> Best regards
> Martin
>
>
> -----Original Message-----
> From: Rohan Kasat <ro...@gmail.com>
> Sent: 26. oktober 2018 21:45
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Hi Martin,
>
> Are you using it For image formats , I think you can try tess4j and use
> give TESSDATA_PREFIX as the home for tessarct Configs.
>
> I have tried it and it works pretty well in my local machine.
>
> I have used java 8 and tesseact 3 for the same.
>
> Regards,
> Rohan Kasat
>
> On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
> > Hi Tim,
> >
> > You were right.
> >
> > When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> > dan`, I got an error message so I downloaded "dan.traineddata" and
> > added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> > 'TESSDATA_PREFIX' variable to the path-variables pointing to
> > "Tesseract-OCR/tessdata".
> >
> > Now Tesseract works with Danish language from the CMD, but now I can't
> > make the code work in Java, not even with default settings (which I
> > could before). Am I missing something or just mixing some things up?
> >
> >
> >
> > -----Original Message-----
> > From: Tim Allison <ta...@apache.org>
> > Sent: 26. oktober 2018 19:58
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Tika relies on you to install tesseract and all the language libraries
> > you'll need.
> >
> > If you can successfully call `tesseract testing/eurotext.png
> > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > with your code above.
> > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > <MH...@kmd.dk>
> > wrote:
> > >
> > > Hi again,
> > >
> > > Now I moved the OCR part to Tika, but I still can't make it work
> > > with
> > Danish. It works when using default language settings and it seems
> > like Tika is missing Danish dictionary.
> > >
> > > My java code looks like this:
> > >
> > > {
> > >             File file = new File(pathfilename);
> > >
> > >             Metadata meta = new Metadata();
> > >
> > >             InputStream stream = TikaInputStream.get(file);
> > >
> > >             Parser parser = new AutoDetectParser();
> > >             BodyContentHandler handler = new
> > > BodyContentHandler(Integer.MAX_VALUE);
> > >
> > >             TesseractOCRConfig config = new TesseractOCRConfig();
> > >             config.setLanguage("dan"); // code works if this phrase
> > > is
> > commented out.
> > >
> > >             ParseContext parseContext = new ParseContext();
> > >
> > >              parseContext.set(TesseractOCRConfig.class, config);
> > >
> > >             parser.parse(stream, handler, meta, parseContext);
> > >             System.out.println(handler.toString());
> > > }
> > >
> > > Hope that someone can help here.
> > >
> > > -----Original Message-----
> > > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > Sent: 22. oktober 2018 07:58
> <https://maps.google.com/?q=tober+2018+07:58&entry=gmail&source=g>
> > > To: solr-user@lucene.apache.org
> > > Subject: SV: Tessera
> > <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct
> > language
> > >
> > > Hi Erick,
> > >
> > > Thanks for the help! I will take a look at it.
> > >
> > >
> > > Martin Frank Hansen, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > > -----Oprindelig meddelelse-----
> > > Fra: Erick Erickson <er...@gmail.com>
> > > Sendt: 21. oktober 2018 22:49
> > > Til: solr-user <so...@lucene.apache.org>
> > > Emne: Re: Tesseract language
> > >
> > > Here's a skeletal program that uses Tika in a stand-alone client.
> > > Rip
> > the RDBMS parts out....
> > >
> > > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> > arafalov@gmail.com> wrote:
> > > >
> > > > Usually, we just say to do a custom solution using SolrJ client to
> > > > connect. This gives you maximum flexibility and allows to
> > > > integrate Tika either inside your code or as a server. Latest Tika
> > > > actually has some off-thread handling I believe, to make it safer to
> embed.
> > > >
> > > > For DIH alternatives, if you want configuration over custom code,
> > > > you could look at something like Apache NiFI. It can push data
> > > > into
> > Solr.
> > > > Obviously it is a bigger solution, but it is correspondingly more
> > > > robust too.
> > > >
> > > > Regards,
> > > >    Alex.
> > > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > > <MH...@kmd.dk>
> > wrote:
> > > > >
> > > > > Hi Alexandre,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > Yes right now it is just for testing the possibilities of Solr
> > > > > and
> > Tesseract.
> > > > >
> > > > > I will take a look at the Tika documentation to see if I can
> > > > > make it
> > work.
> > > > >
> > > > > You said that DIH are not recommended for production usage, what
> > > > > is
> > the recommended method(s) to upload data to a Solr instance?
> > > > >
> > > > > Best regards
> > > > >
> > > > > Martin Frank Hansen
> > > > >
> > > > > -----Oprindelig meddelelse-----
> > > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > > Sendt: 21. oktober 2018 16:26
> > > > > Til: solr-user <so...@lucene.apache.org>
> > > > > Emne: Re: Tesseract language
> > > > >
> > > > > There is a couple of things mixed in here:
> > > > > 1) Extract handler is not recommended for production usage. It
> > > > > is
> > great for a quick test, just like you did it, but going to production,
> > running it externally is better. Tika - especially with large files
> > can use up a lot of memory and trip up the Solr instance it is running
> within.
> > > > > 2) If you are still just testing, you can configure Tika within
> > > > > Solr
> > but specifying parseContent.config file as shown at the link and
> > described further down in the same document:
> > > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-sol
> > > > > r-
> > > > > ce
> > > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequest
> > > > > ha nd ler You still need to check with Tika documentation with
> > > > > Tesseract can take its configuration from the parseContext file.
> > > > > 3) If you are still testing with multiple files, Data Import
> > > > > Handler
> > can iterate through files and then - as a nested entity - feed it to
> > Tika processor for further extraction. I think one of the examples shows
> that.
> > > > > However, I am not sure you can pass parseContext that way and
> > > > > DIH is
> > also not recommended for production.
> > > > >
> > > > > I hope this helps,
> > > > >     Alex.
> > > > >
> > > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > > <MH...@kmd.dk>
> > wrote:
> > > > >
> > > > > > Hi again,
> > > > > >
> > > > > >
> > > > > >
> > > > > > Is there anyone who has some experience of using Tesseract’s
> > > > > > OCR module within Solr? The files I am trying to read into
> > > > > > Solr is Danish Tiff documents.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > >
> > > > > > Data, IM & Analytics
> > > > > >
> > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > >
> > > > > >
> > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > www.kmd.dk Mobil +4525571418
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > > *Sendt:* 18. oktober
> <https://maps.google.com/?q=t:*+18.+oktober+&entry=gmail&source=g>2018
> 13:30
> > > > > > *Til:* solr-user@lucene.apache.org
> > > > > > *Emne:* Tesseract language
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have been trying to use Tesseract through the
> > > > > > data-import-handler in Solr and it actually works very well –
> > > > > > with English. As the documents are in Danish, I need to change
> > > > > > the language setting in Tesseract to
> > <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Dani
> > sh
> > as well, is that possible from Solr?
> > > > > >
> > > > > >
> > > > > >
> > > > > > I was using the update/extract-handler to import single files
> > > > > > into Solr, and it worked for a single file, how would I
> > > > > > implement several files from a file-system?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Here is the request-handler I used:
> > > > > >
> > > > > >
> > > > > >
> > > > > > <requestHandler name="/update/extract"
> > > > > >
> > > > > >                   startup="lazy"
> > > > > >
> > > > > >
>  class="solr.extraction.ExtractingRequestHandler"
> > > > > > >
> > > > > >
> > > > > >     <lst name="defaults">
> > > > > >
> > > > > >       <str name="lowernames">false</str>
> > > > > >
> > > > > >       <str name="uprefix">ignored_</str>
> > > > > >
> > > > > >       <str name="captureAttr">true</str>
> > > > > >
> > > > > >     </lst>
> > > > > >
> > > > > >   </requestHandler>
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > > >
> > > > > > Data, IM & Analytics
> > > > > >
> > > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > > >
> > > > > >
> > > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > > www.kmd.dk Mobil +4525571418
> > > > > >
> > > > > >
> > > > > >
> > > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > > Her finder du KMD’s Privatlivspolitik
> > > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan
> > > > > > vi
> > behandler oplysninger om dig.
> > > > > >
> > > > > > Protection of your personal data is important to us. Here you
> > > > > > can read KMD’s Privacy Policy
> > > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > > outlining how we process your personal data.
> > > > > >
> > > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > information.
> > > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > > venligst informere afsender om fejlen ved at bruge
> svarfunktionen.
> > > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > videresende eller kopiere den.
> > > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > > > computeren eller it-systemet, hvori den modtages og læses,
> > > > > > åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget
> > > > > > ansvar for tab og skade, som er opstået i forbindelse med at
> > > > > > modtage og
> > bruge e-mailen.
> > > > > >
> > > > > > Please note that this message may contain confidential
> > > > > > information. If you have received this message by mistake,
> > > > > > please inform the sender of the mistake by sending a reply,
> > > > > > then delete the message from your system without making,
> > > > > > distributing
> > or retaining any copies of it.
> > > > > > Although we believe that the message and any attachments are
> > > > > > free from viruses and other errors that might affect the
> > > > > > computer or it-system where it is received and read, the
> > > > > > recipient
> > opens the message at his or her own risk.
> > > > > > We assume no responsibility for any loss or damage arising
> > > > > > from the receipt or use of this message.
> > > > > >
> >
> --
>
> *Regards,Rohan Kasat*
>
-- 

*Regards,Rohan Kasat*

RE: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Rohan,

Thanks for your reply, are you using tess4j with Tika or on its own?  I will take a look at tess4j if I can't make it work with Tika alone.

Best regards
Martin


-----Original Message-----
From: Rohan Kasat <ro...@gmail.com>
Sent: 26. oktober 2018 21:45
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Hi Martin,

Are you using it For image formats , I think you can try tess4j and use give TESSDATA_PREFIX as the home for tessarct Configs.

I have tried it and it works pretty well in my local machine.

I have used java 8 and tesseact 3 for the same.

Regards,
Rohan Kasat

On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Tim,
>
> You were right.
>
> When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> dan`, I got an error message so I downloaded "dan.traineddata" and
> added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> 'TESSDATA_PREFIX' variable to the path-variables pointing to
> "Tesseract-OCR/tessdata".
>
> Now Tesseract works with Danish language from the CMD, but now I can't
> make the code work in Java, not even with default settings (which I
> could before). Am I missing something or just mixing some things up?
>
>
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 26. oktober 2018 19:58
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Tika relies on you to install tesseract and all the language libraries
> you'll need.
>
> If you can successfully call `tesseract testing/eurotext.png
> testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> with your code above.
> On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> <MH...@kmd.dk>
> wrote:
> >
> > Hi again,
> >
> > Now I moved the OCR part to Tika, but I still can't make it work
> > with
> Danish. It works when using default language settings and it seems
> like Tika is missing Danish dictionary.
> >
> > My java code looks like this:
> >
> > {
> >             File file = new File(pathfilename);
> >
> >             Metadata meta = new Metadata();
> >
> >             InputStream stream = TikaInputStream.get(file);
> >
> >             Parser parser = new AutoDetectParser();
> >             BodyContentHandler handler = new
> > BodyContentHandler(Integer.MAX_VALUE);
> >
> >             TesseractOCRConfig config = new TesseractOCRConfig();
> >             config.setLanguage("dan"); // code works if this phrase
> > is
> commented out.
> >
> >             ParseContext parseContext = new ParseContext();
> >
> >              parseContext.set(TesseractOCRConfig.class, config);
> >
> >             parser.parse(stream, handler, meta, parseContext);
> >             System.out.println(handler.toString());
> > }
> >
> > Hope that someone can help here.
> >
> > -----Original Message-----
> > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > Sent: 22. oktober 2018 07:58
> > To: solr-user@lucene.apache.org
> > Subject: SV: Tessera
> <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct
> language
> >
> > Hi Erick,
> >
> > Thanks for the help! I will take a look at it.
> >
> >
> > Martin Frank Hansen, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> > -----Oprindelig meddelelse-----
> > Fra: Erick Erickson <er...@gmail.com>
> > Sendt: 21. oktober 2018 22:49
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > Here's a skeletal program that uses Tika in a stand-alone client.
> > Rip
> the RDBMS parts out....
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> arafalov@gmail.com> wrote:
> > >
> > > Usually, we just say to do a custom solution using SolrJ client to
> > > connect. This gives you maximum flexibility and allows to
> > > integrate Tika either inside your code or as a server. Latest Tika
> > > actually has some off-thread handling I believe, to make it safer to embed.
> > >
> > > For DIH alternatives, if you want configuration over custom code,
> > > you could look at something like Apache NiFI. It can push data
> > > into
> Solr.
> > > Obviously it is a bigger solution, but it is correspondingly more
> > > robust too.
> > >
> > > Regards,
> > >    Alex.
> > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Yes right now it is just for testing the possibilities of Solr
> > > > and
> Tesseract.
> > > >
> > > > I will take a look at the Tika documentation to see if I can
> > > > make it
> work.
> > > >
> > > > You said that DIH are not recommended for production usage, what
> > > > is
> the recommended method(s) to upload data to a Solr instance?
> > > >
> > > > Best regards
> > > >
> > > > Martin Frank Hansen
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > Sendt: 21. oktober 2018 16:26
> > > > Til: solr-user <so...@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > There is a couple of things mixed in here:
> > > > 1) Extract handler is not recommended for production usage. It
> > > > is
> great for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files
> can use up a lot of memory and trip up the Solr instance it is running within.
> > > > 2) If you are still just testing, you can configure Tika within
> > > > Solr
> but specifying parseContent.config file as shown at the link and
> described further down in the same document:
> > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-sol
> > > > r-
> > > > ce
> > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequest
> > > > ha nd ler You still need to check with Tika documentation with
> > > > Tesseract can take its configuration from the parseContext file.
> > > > 3) If you are still testing with multiple files, Data Import
> > > > Handler
> can iterate through files and then - as a nested entity - feed it to
> Tika processor for further extraction. I think one of the examples shows that.
> > > > However, I am not sure you can pass parseContext that way and
> > > > DIH is
> also not recommended for production.
> > > >
> > > > I hope this helps,
> > > >     Alex.
> > > >
> > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)
> > > > <MH...@kmd.dk>
> wrote:
> > > >
> > > > > Hi again,
> > > > >
> > > > >
> > > > >
> > > > > Is there anyone who has some experience of using Tesseract’s
> > > > > OCR module within Solr? The files I am trying to read into
> > > > > Solr is Danish Tiff documents.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > >
> > > > > Data, IM & Analytics
> > > > >
> > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > >
> > > > >
> > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > www.kmd.dk Mobil +4525571418
> > > > >
> > > > >
> > > > >
> > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > *Sendt:* 18. oktober 2018 13:30
> > > > > *Til:* solr-user@lucene.apache.org
> > > > > *Emne:* Tesseract language
> > > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I have been trying to use Tesseract through the
> > > > > data-import-handler in Solr and it actually works very well –
> > > > > with English. As the documents are in Danish, I need to change
> > > > > the language setting in Tesseract to
> <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Dani
> sh
> as well, is that possible from Solr?
> > > > >
> > > > >
> > > > >
> > > > > I was using the update/extract-handler to import single files
> > > > > into Solr, and it worked for a single file, how would I
> > > > > implement several files from a file-system?
> > > > >
> > > > >
> > > > >
> > > > > Here is the request-handler I used:
> > > > >
> > > > >
> > > > >
> > > > > <requestHandler name="/update/extract"
> > > > >
> > > > >                   startup="lazy"
> > > > >
> > > > >                   class="solr.extraction.ExtractingRequestHandler"
> > > > > >
> > > > >
> > > > >     <lst name="defaults">
> > > > >
> > > > >       <str name="lowernames">false</str>
> > > > >
> > > > >       <str name="uprefix">ignored_</str>
> > > > >
> > > > >       <str name="captureAttr">true</str>
> > > > >
> > > > >     </lst>
> > > > >
> > > > >   </requestHandler>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > >
> > > > > Data, IM & Analytics
> > > > >
> > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > >
> > > > >
> > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > www.kmd.dk Mobil +4525571418
> > > > >
> > > > >
> > > > >
> > > > > Beskyttelse af dine personlige oplysninger er vigtig for os.
> > > > > Her finder du KMD’s Privatlivspolitik
> > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan
> > > > > vi
> behandler oplysninger om dig.
> > > > >
> > > > > Protection of your personal data is important to us. Here you
> > > > > can read KMD’s Privacy Policy
> > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > outlining how we process your personal data.
> > > > >
> > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> videresende eller kopiere den.
> > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > > computeren eller it-systemet, hvori den modtages og læses,
> > > > > åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget
> > > > > ansvar for tab og skade, som er opstået i forbindelse med at
> > > > > modtage og
> bruge e-mailen.
> > > > >
> > > > > Please note that this message may contain confidential
> > > > > information. If you have received this message by mistake,
> > > > > please inform the sender of the mistake by sending a reply,
> > > > > then delete the message from your system without making,
> > > > > distributing
> or retaining any copies of it.
> > > > > Although we believe that the message and any attachments are
> > > > > free from viruses and other errors that might affect the
> > > > > computer or it-system where it is received and read, the
> > > > > recipient
> opens the message at his or her own risk.
> > > > > We assume no responsibility for any loss or damage arising
> > > > > from the receipt or use of this message.
> > > > >
>
--

*Regards,Rohan Kasat*

Re: Tesseract language

Posted by Rohan Kasat <ro...@gmail.com>.
Hi Martin,

Are you using it For image formats , I think you can try tess4j and use
give TESSDATA_PREFIX as the home for tessarct Configs.

I have tried it and it works pretty well in my local machine.

I have used java 8 and tesseact 3 for the same.

Regards,
Rohan Kasat

On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Tim,
>
> You were right.
>
> When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> dan`, I got an error message so I downloaded "dan.traineddata" and added it
> to the Tesseract-OCR/tessdata folder. Furthermore I added the
> 'TESSDATA_PREFIX' variable to the path-variables pointing to
> "Tesseract-OCR/tessdata".
>
> Now Tesseract works with Danish language from the CMD, but now I can't
> make the code work in Java, not even with default settings (which I could
> before). Am I missing something or just mixing some things up?
>
>
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 26. oktober 2018 19:58
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Tika relies on you to install tesseract and all the language libraries
> you'll need.
>
> If you can successfully call `tesseract testing/eurotext.png
> testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> with your code above.
> On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > Hi again,
> >
> > Now I moved the OCR part to Tika, but I still can't make it work with
> Danish. It works when using default language settings and it seems like
> Tika is missing Danish dictionary.
> >
> > My java code looks like this:
> >
> > {
> >             File file = new File(pathfilename);
> >
> >             Metadata meta = new Metadata();
> >
> >             InputStream stream = TikaInputStream.get(file);
> >
> >             Parser parser = new AutoDetectParser();
> >             BodyContentHandler handler = new
> > BodyContentHandler(Integer.MAX_VALUE);
> >
> >             TesseractOCRConfig config = new TesseractOCRConfig();
> >             config.setLanguage("dan"); // code works if this phrase is
> commented out.
> >
> >             ParseContext parseContext = new ParseContext();
> >
> >              parseContext.set(TesseractOCRConfig.class, config);
> >
> >             parser.parse(stream, handler, meta, parseContext);
> >             System.out.println(handler.toString());
> > }
> >
> > Hope that someone can help here.
> >
> > -----Original Message-----
> > From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > Sent: 22. oktober 2018 07:58
> > To: solr-user@lucene.apache.org
> > Subject: SV: Tessera
> <https://maps.google.com/?q=ect:+SV:+Tessera&entry=gmail&source=g>ct
> language
> >
> > Hi Erick,
> >
> > Thanks for the help! I will take a look at it.
> >
> >
> > Martin Frank Hansen, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup
> > E-mail mhq@kmd.dk  Web www.kmd.dk
> > Mobil +4525571418
> >
> > -----Oprindelig meddelelse-----
> > Fra: Erick Erickson <er...@gmail.com>
> > Sendt: 21. oktober 2018 22:49
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > Here's a skeletal program that uses Tika in a stand-alone client. Rip
> the RDBMS parts out....
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> arafalov@gmail.com> wrote:
> > >
> > > Usually, we just say to do a custom solution using SolrJ client to
> > > connect. This gives you maximum flexibility and allows to integrate
> > > Tika either inside your code or as a server. Latest Tika actually
> > > has some off-thread handling I believe, to make it safer to embed.
> > >
> > > For DIH alternatives, if you want configuration over custom code,
> > > you could look at something like Apache NiFI. It can push data into
> Solr.
> > > Obviously it is a bigger solution, but it is correspondingly more
> > > robust too.
> > >
> > > Regards,
> > >    Alex.
> > > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> > > >
> > > > I will take a look at the Tika documentation to see if I can make it
> work.
> > > >
> > > > You said that DIH are not recommended for production usage, what is
> the recommended method(s) to upload data to a Solr instance?
> > > >
> > > > Best regards
> > > >
> > > > Martin Frank Hansen
> > > >
> > > > -----Oprindelig meddelelse-----
> > > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > > Sendt: 21. oktober 2018 16:26
> > > > Til: solr-user <so...@lucene.apache.org>
> > > > Emne: Re: Tesseract language
> > > >
> > > > There is a couple of things mixed in here:
> > > > 1) Extract handler is not recommended for production usage. It is
> great for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files can use
> up a lot of memory and trip up the Solr instance it is running within.
> > > > 2) If you are still just testing, you can configure Tika within Solr
> but specifying parseContent.config file as shown at the link and described
> further down in the same document:
> > > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-
> > > > ce
> > > > ll-using-apache-tika.html#configuring-the-solr-extractingrequestha
> > > > nd ler You still need to check with Tika documentation with
> > > > Tesseract can take its configuration from the parseContext file.
> > > > 3) If you are still testing with multiple files, Data Import Handler
> can iterate through files and then - as a nested entity - feed it to Tika
> processor for further extraction. I think one of the examples shows that.
> > > > However, I am not sure you can pass parseContext that way and DIH is
> also not recommended for production.
> > > >
> > > > I hope this helps,
> > > >     Alex.
> > > >
> > > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> > > >
> > > > > Hi again,
> > > > >
> > > > >
> > > > >
> > > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > > module within Solr? The files I am trying to read into Solr is
> > > > > Danish Tiff documents.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > >
> > > > > Data, IM & Analytics
> > > > >
> > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > >
> > > > >
> > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > www.kmd.dk Mobil +4525571418
> > > > >
> > > > >
> > > > >
> > > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > > *Sendt:* 18. oktober 2018 13:30
> > > > > *Til:* solr-user@lucene.apache.org
> > > > > *Emne:* Tesseract language
> > > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I have been trying to use Tesseract through the
> > > > > data-import-handler in Solr and it actually works very well –
> > > > > with English. As the documents are in Danish, I need to change
> > > > > the language setting in Tesseract to
> <https://maps.google.com/?q=in+Tesseract+to+&entry=gmail&source=g>Danish
> as well, is that possible from Solr?
> > > > >
> > > > >
> > > > >
> > > > > I was using the update/extract-handler to import single files
> > > > > into Solr, and it worked for a single file, how would I
> > > > > implement several files from a file-system?
> > > > >
> > > > >
> > > > >
> > > > > Here is the request-handler I used:
> > > > >
> > > > >
> > > > >
> > > > > <requestHandler name="/update/extract"
> > > > >
> > > > >                   startup="lazy"
> > > > >
> > > > >                   class="solr.extraction.ExtractingRequestHandler"
> > > > > >
> > > > >
> > > > >     <lst name="defaults">
> > > > >
> > > > >       <str name="lowernames">false</str>
> > > > >
> > > > >       <str name="uprefix">ignored_</str>
> > > > >
> > > > >       <str name="captureAttr">true</str>
> > > > >
> > > > >     </lst>
> > > > >
> > > > >   </requestHandler>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > > >
> > > > > Data, IM & Analytics
> > > > >
> > > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > > >
> > > > >
> > > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > > www.kmd.dk Mobil +4525571418
> > > > >
> > > > >
> > > > >
> > > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > > finder du KMD’s Privatlivspolitik
> > > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> > > > >
> > > > > Protection of your personal data is important to us. Here you
> > > > > can read KMD’s Privacy Policy
> > > > > <http://www.kmd.net/Privacy-Policy>
> > > > > outlining how we process your personal data.
> > > > >
> > > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> > > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > > > Samtidig beder vi dig slette e-mailen i dit system uden at
> videresende eller kopiere den.
> > > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > > > for tab og skade, som er opstået i forbindelse med at modtage og
> bruge e-mailen.
> > > > >
> > > > > Please note that this message may contain confidential
> > > > > information. If you have received this message by mistake,
> > > > > please inform the sender of the mistake by sending a reply, then
> > > > > delete the message from your system without making, distributing
> or retaining any copies of it.
> > > > > Although we believe that the message and any attachments are
> > > > > free from viruses and other errors that might affect the
> > > > > computer or it-system where it is received and read, the recipient
> opens the message at his or her own risk.
> > > > > We assume no responsibility for any loss or damage arising from
> > > > > the receipt or use of this message.
> > > > >
>
-- 

*Regards,Rohan Kasat*

RE: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Tim,

You were right.

When I called `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, I got an error message so I downloaded "dan.traineddata" and added it to the Tesseract-OCR/tessdata folder. Furthermore I added the 'TESSDATA_PREFIX' variable to the path-variables pointing to "Tesseract-OCR/tessdata".

Now Tesseract works with Danish language from the CMD, but now I can't make the code work in Java, not even with default settings (which I could before). Am I missing something or just mixing some things up?



-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 26. oktober 2018 19:58
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Tika relies on you to install tesseract and all the language libraries you'll need.

If you can successfully call `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. It works when using default language settings and it seems like Tika is missing Danish dictionary.
>
> My java code looks like this:
>
> {
>             File file = new File(pathfilename);
>
>             Metadata meta = new Metadata();
>
>             InputStream stream = TikaInputStream.get(file);
>
>             Parser parser = new AutoDetectParser();
>             BodyContentHandler handler = new
> BodyContentHandler(Integer.MAX_VALUE);
>
>             TesseractOCRConfig config = new TesseractOCRConfig();
>             config.setLanguage("dan"); // code works if this phrase is commented out.
>
>             ParseContext parseContext = new ParseContext();
>
>              parseContext.set(TesseractOCRConfig.class, config);
>
>             parser.parse(stream, handler, meta, parseContext);
>             System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -----Original Message-----
> From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -----Oprindelig meddelelse-----
> Fra: Erick Erickson <er...@gmail.com>
> Sendt: 21. oktober 2018 22:49
> Til: solr-user <so...@lucene.apache.org>
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out....
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually
> > has some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code,
> > you could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >    Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -----Oprindelig meddelelse-----
> > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user <so...@lucene.apache.org>
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-
> > > ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequestha
> > > nd ler You still need to check with Tika documentation with
> > > Tesseract can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> > > However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
> > >
> > > I hope this helps,
> > >     Alex.
> > >
> > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> > >
> > > > Hi again,
> > > >
> > > >
> > > >
> > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > module within Solr? The files I am trying to read into Solr is
> > > > Danish Tiff documents.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > *Sendt:* 18. oktober 2018 13:30
> > > > *Til:* solr-user@lucene.apache.org
> > > > *Emne:* Tesseract language
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I have been trying to use Tesseract through the
> > > > data-import-handler in Solr and it actually works very well –
> > > > with English. As the documents are in Danish, I need to change
> > > > the language setting in Tesseract to Danish as well, is that possible from Solr?
> > > >
> > > >
> > > >
> > > > I was using the update/extract-handler to import single files
> > > > into Solr, and it worked for a single file, how would I
> > > > implement several files from a file-system?
> > > >
> > > >
> > > >
> > > > Here is the request-handler I used:
> > > >
> > > >
> > > >
> > > > <requestHandler name="/update/extract"
> > > >
> > > >                   startup="lazy"
> > > >
> > > >                   class="solr.extraction.ExtractingRequestHandler"
> > > > >
> > > >
> > > >     <lst name="defaults">
> > > >
> > > >       <str name="lowernames">false</str>
> > > >
> > > >       <str name="uprefix">ignored_</str>
> > > >
> > > >       <str name="captureAttr">true</str>
> > > >
> > > >     </lst>
> > > >
> > > >   </requestHandler>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > finder du KMD’s Privatlivspolitik
> > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> > > >
> > > > Protection of your personal data is important to us. Here you
> > > > can read KMD’s Privacy Policy
> > > > <http://www.kmd.net/Privacy-Policy>
> > > > outlining how we process your personal data.
> > > >
> > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > > Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > > for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > > >
> > > > Please note that this message may contain confidential
> > > > information. If you have received this message by mistake,
> > > > please inform the sender of the mistake by sending a reply, then
> > > > delete the message from your system without making, distributing or retaining any copies of it.
> > > > Although we believe that the message and any attachments are
> > > > free from viruses and other errors that might affect the
> > > > computer or it-system where it is received and read, the recipient opens the message at his or her own risk.
> > > > We assume no responsibility for any loss or damage arising from
> > > > the receipt or use of this message.
> > > >

Re: Tesseract language

Posted by Tim Allison <ta...@apache.org>.
Tika relies on you to install tesseract and all the language libraries
you'll need.

If you can successfully call `tesseract testing/eurotext.png
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. It works when using default language settings and it seems like Tika is missing Danish dictionary.
>
> My java code looks like this:
>
> {
>             File file = new File(pathfilename);
>
>             Metadata meta = new Metadata();
>
>             InputStream stream = TikaInputStream.get(file);
>
>             Parser parser = new AutoDetectParser();
>             BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
>
>             TesseractOCRConfig config = new TesseractOCRConfig();
>             config.setLanguage("dan"); // code works if this phrase is commented out.
>
>             ParseContext parseContext = new ParseContext();
>
>              parseContext.set(TesseractOCRConfig.class, config);
>
>             parser.parse(stream, handler, meta, parseContext);
>             System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -----Original Message-----
> From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -----Oprindelig meddelelse-----
> Fra: Erick Erickson <er...@gmail.com>
> Sendt: 21. oktober 2018 22:49
> Til: solr-user <so...@lucene.apache.org>
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out....
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually has
> > some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code, you
> > could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >    Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -----Oprindelig meddelelse-----
> > > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user <so...@lucene.apache.org>
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> > > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > > ler You still need to check with Tika documentation with Tesseract
> > > can take its configuration from the parseContext file.
> > > 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> > > However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
> > >
> > > I hope this helps,
> > >     Alex.
> > >
> > > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> > >
> > > > Hi again,
> > > >
> > > >
> > > >
> > > > Is there anyone who has some experience of using Tesseract’s OCR
> > > > module within Solr? The files I am trying to read into Solr is
> > > > Danish Tiff documents.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > > *Sendt:* 18. oktober 2018 13:30
> > > > *Til:* solr-user@lucene.apache.org
> > > > *Emne:* Tesseract language
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I have been trying to use Tesseract through the
> > > > data-import-handler in Solr and it actually works very well – with
> > > > English. As the documents are in Danish, I need to change the
> > > > language setting in Tesseract to Danish as well, is that possible from Solr?
> > > >
> > > >
> > > >
> > > > I was using the update/extract-handler to import single files into
> > > > Solr, and it worked for a single file, how would I implement
> > > > several files from a file-system?
> > > >
> > > >
> > > >
> > > > Here is the request-handler I used:
> > > >
> > > >
> > > >
> > > > <requestHandler name="/update/extract"
> > > >
> > > >                   startup="lazy"
> > > >
> > > >                   class="solr.extraction.ExtractingRequestHandler"
> > > > >
> > > >
> > > >     <lst name="defaults">
> > > >
> > > >       <str name="lowernames">false</str>
> > > >
> > > >       <str name="uprefix">ignored_</str>
> > > >
> > > >       <str name="captureAttr">true</str>
> > > >
> > > >     </lst>
> > > >
> > > >   </requestHandler>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Martin Frank Hansen*, Senior Data Analytiker
> > > >
> > > > Data, IM & Analytics
> > > >
> > > > [image: cid:image001.png@01D383C9.6C129A60]
> > > >
> > > >
> > > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > > www.kmd.dk Mobil +4525571418
> > > >
> > > >
> > > >
> > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > finder du KMD’s Privatlivspolitik
> > > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> > > >
> > > > Protection of your personal data is important to us. Here you can
> > > > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > > > outlining how we process your personal data.
> > > >
> > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > > Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > > for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > > >
> > > > Please note that this message may contain confidential
> > > > information. If you have received this message by mistake, please
> > > > inform the sender of the mistake by sending a reply, then delete
> > > > the message from your system without making, distributing or retaining any copies of it.
> > > > Although we believe that the message and any attachments are free
> > > > from viruses and other errors that might affect the computer or
> > > > it-system where it is received and read, the recipient opens the message at his or her own risk.
> > > > We assume no responsibility for any loss or damage arising from
> > > > the receipt or use of this message.
> > > >

RE: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi again,

Now I moved the OCR part to Tika, but I still can't make it work with Danish. It works when using default language settings and it seems like Tika is missing Danish dictionary.

My java code looks like this:

{
            File file = new File(pathfilename);

            Metadata meta = new Metadata();

            InputStream stream = TikaInputStream.get(file);

            Parser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

            TesseractOCRConfig config = new TesseractOCRConfig();
            config.setLanguage("dan"); // code works if this phrase is commented out.

            ParseContext parseContext = new ParseContext();

             parseContext.set(TesseractOCRConfig.class, config);

            parser.parse(stream, handler, meta, parseContext);
            System.out.println(handler.toString());
}

Hope that someone can help here.

-----Original Message-----
From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
Sent: 22. oktober 2018 07:58
To: solr-user@lucene.apache.org
Subject: SV: Tesseract language

Hi Erick,

Thanks for the help! I will take a look at it.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-----Oprindelig meddelelse-----
Fra: Erick Erickson <er...@gmail.com>
Sendt: 21. oktober 2018 22:49
Til: solr-user <so...@lucene.apache.org>
Emne: Re: Tesseract language

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out....

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > ler You still need to check with Tika documentation with Tesseract
> > can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > language setting in Tesseract to Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement
> > > several files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler"
> > > >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > > outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential
> > > information. If you have received this message by mistake, please
> > > inform the sender of the mistake by sending a reply, then delete
> > > the message from your system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the message at his or her own risk.
> > > We assume no responsibility for any loss or damage arising from
> > > the receipt or use of this message.
> > >

SV: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Erick,

Thanks for the help! I will take a look at it.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-----Oprindelig meddelelse-----
Fra: Erick Erickson <er...@gmail.com>
Sendt: 21. oktober 2018 22:49
Til: solr-user <so...@lucene.apache.org>
Emne: Re: Tesseract language

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out....

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <ar...@gmail.com> wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > ler You still need to check with Tika documentation with Tesseract
> > can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > language setting in Tesseract to Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement
> > > several files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler"
> > > >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> > > outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> > > venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores
> > > overbevisning er fri for virus og andre fejl, som kan påvirke
> > > computeren eller it-systemet, hvori den modtages og læses, åbnes
> > > den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar
> > > for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential
> > > information. If you have received this message by mistake, please
> > > inform the sender of the mistake by sending a reply, then delete
> > > the message from your system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the message at his or her own risk.
> > > We assume no responsibility for any loss or damage arising from
> > > the receipt or use of this message.
> > >

Re: Tesseract language

Posted by Erick Erickson <er...@gmail.com>.
Here's a skeletal program that uses Tika in a stand-alone client. Rip
the RDBMS parts out....

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch
<ar...@gmail.com> wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -----Oprindelig meddelelse-----
> > Fra: Alexandre Rafalovitch <ar...@gmail.com>
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user <so...@lucene.apache.org>
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
> >
> > I hope this helps,
> >     Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is Danish
> > > Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail mhq@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the data-import-handler in
> > > Solr and it actually works very well – with English. As the documents
> > > are in Danish, I need to change the language setting in Tesseract to
> > > Danish as well, is that possible from Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement several
> > > files from a file-system?
> > >
> > >
> > >
> > > Here is the request-handler I used:
> > >
> > >
> > >
> > > <requestHandler name="/update/extract"
> > >
> > >                   startup="lazy"
> > >
> > >                   class="solr.extraction.ExtractingRequestHandler" >
> > >
> > >     <lst name="defaults">
> > >
> > >       <str name="lowernames">false</str>
> > >
> > >       <str name="uprefix">ignored_</str>
> > >
> > >       <str name="captureAttr">true</str>
> > >
> > >     </lst>
> > >
> > >   </requestHandler>
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup
> > > E-mail mhq@kmd.dk  Web www.kmd.dk
> > > Mobil +4525571418
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder du KMD’s Privatlivspolitik
> > > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can read
> > > KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> > > we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > > beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > > er fri for virus og andre fejl, som kan påvirke computeren eller
> > > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> > > opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information. If
> > > you have received this message by mistake, please inform the sender of
> > > the mistake by sending a reply, then delete the message from your
> > > system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free from
> > > viruses and other errors that might affect the computer or it-system
> > > where it is received and read, the recipient opens the message at his or her own risk.
> > > We assume no responsibility for any loss or damage arising from the
> > > receipt or use of this message.
> > >

Re: Tesseract language

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Usually, we just say to do a custom solution using SolrJ client to
connect. This gives you maximum flexibility and allows to integrate
Tika either inside your code or as a server. Latest Tika actually has
some off-thread handling I believe, to make it safer to embed.

For DIH alternatives, if you want configuration over custom code, you
could look at something like Apache NiFI. It can push data into Solr.
Obviously it is a bigger solution, but it is correspondingly more
robust too.

Regards,
   Alex.
On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> Hi Alexandre,
>
> Thanks for your reply.
>
> Yes right now it is just for testing the possibilities of Solr and Tesseract.
>
> I will take a look at the Tika documentation to see if I can make it work.
>
> You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?
>
> Best regards
>
> Martin Frank Hansen
>
> -----Oprindelig meddelelse-----
> Fra: Alexandre Rafalovitch <ar...@gmail.com>
> Sendt: 21. oktober 2018 16:26
> Til: solr-user <so...@lucene.apache.org>
> Emne: Re: Tesseract language
>
> There is a couple of things mixed in here:
> 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
> 2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> You still need to check with Tika documentation with Tesseract can take its configuration from the parseContext file.
> 3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
> However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.
>
> I hope this helps,
>     Alex.
>
> On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> > Hi again,
> >
> >
> >
> > Is there anyone who has some experience of using Tesseract’s OCR
> > module within Solr? The files I am trying to read into Solr is Danish
> > Tiff documents.
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup
> > E-mail mhq@kmd.dk  Web www.kmd.dk
> > Mobil +4525571418
> >
> >
> >
> > *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > *Sendt:* 18. oktober 2018 13:30
> > *Til:* solr-user@lucene.apache.org
> > *Emne:* Tesseract language
> >
> >
> >
> > Hi,
> >
> > I have been trying to use Tesseract through the data-import-handler in
> > Solr and it actually works very well – with English. As the documents
> > are in Danish, I need to change the language setting in Tesseract to
> > Danish as well, is that possible from Solr?
> >
> >
> >
> > I was using the update/extract-handler to import single files into
> > Solr, and it worked for a single file, how would I implement several
> > files from a file-system?
> >
> >
> >
> > Here is the request-handler I used:
> >
> >
> >
> > <requestHandler name="/update/extract"
> >
> >                   startup="lazy"
> >
> >                   class="solr.extraction.ExtractingRequestHandler" >
> >
> >     <lst name="defaults">
> >
> >       <str name="lowernames">false</str>
> >
> >       <str name="uprefix">ignored_</str>
> >
> >       <str name="captureAttr">true</str>
> >
> >     </lst>
> >
> >   </requestHandler>
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup
> > E-mail mhq@kmd.dk  Web www.kmd.dk
> > Mobil +4525571418
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder du KMD’s Privatlivspolitik
> > <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can read
> > KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> > we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > er fri for virus og andre fejl, som kan påvirke computeren eller
> > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> > opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If
> > you have received this message by mistake, please inform the sender of
> > the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free from
> > viruses and other errors that might affect the computer or it-system
> > where it is received and read, the recipient opens the message at his or her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >

SV: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Alexandre,

Thanks for your reply.

Yes right now it is just for testing the possibilities of Solr and Tesseract.

I will take a look at the Tika documentation to see if I can make it work.

You said that DIH are not recommended for production usage, what is the recommended method(s) to upload data to a Solr instance?

Best regards

Martin Frank Hansen

-----Oprindelig meddelelse-----
Fra: Alexandre Rafalovitch <ar...@gmail.com>
Sendt: 21. oktober 2018 16:26
Til: solr-user <so...@lucene.apache.org>
Emne: Re: Tesseract language

There is a couple of things mixed in here:
1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr instance it is running within.
2) If you are still just testing, you can configure Tika within Solr but specifying parseContent.config file as shown at the link and described further down in the same document:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
You still need to check with Tika documentation with Tesseract can take its configuration from the parseContext file.
3) If you are still testing with multiple files, Data Import Handler can iterate through files and then - as a nested entity - feed it to Tika processor for further extraction. I think one of the examples shows that.
However, I am not sure you can pass parseContext that way and DIH is also not recommended for production.

I hope this helps,
    Alex.

On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:

> Hi again,
>
>
>
> Is there anyone who has some experience of using Tesseract’s OCR
> module within Solr? The files I am trying to read into Solr is Danish
> Tiff documents.
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> *Sendt:* 18. oktober 2018 13:30
> *Til:* solr-user@lucene.apache.org
> *Emne:* Tesseract language
>
>
>
> Hi,
>
> I have been trying to use Tesseract through the data-import-handler in
> Solr and it actually works very well – with English. As the documents
> are in Danish, I need to change the language setting in Tesseract to
> Danish as well, is that possible from Solr?
>
>
>
> I was using the update/extract-handler to import single files into
> Solr, and it worked for a single file, how would I implement several
> files from a file-system?
>
>
>
> Here is the request-handler I used:
>
>
>
> <requestHandler name="/update/extract"
>
>                   startup="lazy"
>
>                   class="solr.extraction.ExtractingRequestHandler" >
>
>     <lst name="defaults">
>
>       <str name="lowernames">false</str>
>
>       <str name="uprefix">ignored_</str>
>
>       <str name="captureAttr">true</str>
>
>     </lst>
>
>   </requestHandler>
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s Privatlivspolitik
> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>

Re: Tesseract language

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
There is a couple of things mixed in here:
1) Extract handler is not recommended for production usage. It is great for
a quick test, just like you did it, but going to production, running it
externally is better. Tika - especially with large files can use up a lot
of memory and trip up the Solr instance it is running within.
2) If you are still just testing, you can configure Tika within Solr but
specifying parseContent.config file as shown at the link and described
further down in the same document:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
You still need to check with Tika documentation with Tesseract can take its
configuration from the parseContext file.
3) If you are still testing with multiple files, Data Import Handler can
iterate through files and then - as a nested entity - feed it to Tika
processor for further extraction. I think one of the examples shows that.
However, I am not sure you can pass parseContext that way and DIH is also
not recommended for production.

I hope this helps,
    Alex.

On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:

> Hi again,
>
>
>
> Is there anyone who has some experience of using Tesseract’s OCR module
> within Solr? The files I am trying to read into Solr is Danish Tiff
> documents.
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> *Sendt:* 18. oktober 2018 13:30
> *Til:* solr-user@lucene.apache.org
> *Emne:* Tesseract language
>
>
>
> Hi,
>
> I have been trying to use Tesseract through the data-import-handler in
> Solr and it actually works very well – with English. As the documents are
> in Danish, I need to change the language setting in Tesseract to Danish as
> well, is that possible from Solr?
>
>
>
> I was using the update/extract-handler to import single files into Solr,
> and it worked for a single file, how would I implement several files from a
> file-system?
>
>
>
> Here is the request-handler I used:
>
>
>
> <requestHandler name="/update/extract"
>
>                   startup="lazy"
>
>                   class="solr.extraction.ExtractingRequestHandler" >
>
>     <lst name="defaults">
>
>       <str name="lowernames">false</str>
>
>       <str name="uprefix">ignored_</str>
>
>       <str name="captureAttr">true</str>
>
>     </lst>
>
>   </requestHandler>
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s
> Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s
> Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

SV: Tesseract language

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi again,

Is there anyone who has some experience of using Tesseract’s OCR module within Solr? The files I am trying to read into Solr is Danish Tiff documents.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk<ma...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418

Fra: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
Sendt: 18. oktober 2018 13:30
Til: solr-user@lucene.apache.org
Emne: Tesseract language

Hi,

I have been trying to use Tesseract through the data-import-handler in Solr and it actually works very well – with English. As the documents are  in Danish, I need to change the language setting in Tesseract to Danish as well, is that possible from Solr?

I was using the update/extract-handler to import single files into Solr, and it worked for a single file, how would I implement several files from a file-system?

Here is the request-handler I used:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">false</str>
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>
  </requestHandler>


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk<ma...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.