You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk> on 2018/10/25 09:54:55 UTC

Reading data using Tika to Solr

Hi,

I am trying to read content of msg-files using Tika and index these in Solr, however I am having some problems with the OfficeParser(). I keep getting the error java.lang.NoClassDefFoundError for the OfficeParcer, even though both tika-core and tika-parsers are included in the build path.


I am using Java with the following code:


public static void main(final String[] args) throws IOException,SAXException, TikaException {

                            processDocument(pathtofile)

                             }

                            private static void processDocument(String pathfilename)  {


                                 try {

                                                        File file = new File(pathfilename);

                                                        Metadata meta = new Metadata();

                                                         InputStream input = TikaInputStream.get(file);

                                                         BodyContentHandler handler = new BodyContentHandler();

                                                        Parser parser = new OfficeParser();
                                                         ParseContext context = new ParseContext();
                                                         parser.parse(input, handler, meta, context);

                                                         String doccontent = handler.toString();


                                                        System.out.println(doccontent);
                                                        System.out.println(meta);

                                 }
                             }
In the buildpath I have the following dependencies:

[cid:image001.png@01D46C59.8AECF060]

Any help is appreciate.

Thanks in advance.

Best regards,

Martin Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.

Re: Reading data using Tika to Solr

Posted by Tim Allison <ta...@apache.org>.
IIRC, somewhere btwn 1.14 and now (1.19.1), we changed the default behavior
for the AutoDetectParser from skip attachments to include attachments.

So, two options: 1) upgrade to 1.19.1 and use the AutoDetectParser or 2)
pass an AutoDetectParser via the ParseContext to be used for attachments.

If you’re wondering why you might upgrade to 1.19.1, look no further than:
https://tika.apache.org/security.html



On Fri, Oct 26, 2018 at 4:14 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But how do I get it to read the attachments as well?
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>

RE: Reading data using Tika to Solr

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Tim,

Thanks again, I will update Tika and try it again.

-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 26. oktober 2018 12:53
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so no need to re-invent that wheel.  That’s my preferred method/format because it maintains metadata from attachments and lets you know about exceptions in embedded files. The legacy method concatenates contents, throws out attachment metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well
> using the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -----Original Message-----
> From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and
> now it works 😊 But how do I get it to read the attachments as well?
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end 😉
> > The attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried
> > maven yet but will probably have to give that a chance. I just find
> > it a bit odd that I can see the dependencies are included in the jar
> > files I added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -----Original Message-----
> > From: Tim Allison <ta...@apache.org>
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of
> > jar hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > <er...@gmail.com>
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a
> > > silly question), how are you executing the program .vs. compiling
> > > it? You mentioned the "build path". I'm usually lazy and just
> > > execute it in IntelliJ for development and have forgotten to set
> > > my classpath on _numerous_ occasions when running it from a
> > > command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both tika-core and tika-parsers are
> > > included
> > in the build path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I am using Java with the following code:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > public static void main(final String[] args) throws
> > > IOException,SAXException, TikaException {
> > > >
> > > >
> > > >
> > > >                             processDocument(pathtofile)
> > > >
> > > >
> > > >
> > > >                              }
> > > >
> > > >
> > > >
> > > >                             private static void
> > > > processDocument(String
> > > pathfilename)  {
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >                                  try {
> > > >
> > > >
> > > >
> > > >                                                         File
> > > > file = new
> > > File(pathfilename);
> > > >
> > > >
> > > >
> > > >                                                         Metadata
> > > > meta =
> > > new Metadata();
> > > >
> > > >
> > > >
> > > >
> > > > InputStream
> > > input = TikaInputStream.get(file);
> > > >
> > > >
> > > >
> > > >
> > > BodyContentHandler handler = new BodyContentHandler();
> > > >
> > > >
> > > >
> > > >                                                         Parser
> > > > parser =
> > > new OfficeParser();
> > > >
> > > >
> > > > ParseContext
> > > context = new ParseContext();
> > > >
> > > >
> > > parser.parse(input, handler, meta, context);
> > > >
> > > >
> > > >
> > > >                                                          String
> > > doccontent = handler.toString();
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >  System.out.println(doccontent);
> > > >
> > > >
> > >  System.out.println(meta);
> > > >
> > > >
> > > >
> > > >                                  }
> > > >
> > > >                              }
> > > >
> > > > In the buildpath I have the following dependencies:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Any help is appreciate.
> > > >
> > > >
> > > >
> > > > Thanks in advance.
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > >
> > > >
> > > > Martin Hansen
> > > >
> > > >
> > > >
> > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > finder
> > > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > > oplysninger om dig.
> > > >
> > > > Protection of your personal data is important to us. Here you
> > > > can read
> > > KMD’s Privacy Policy outlining how we process your personal data.
> > > >
> > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > > information. Hvis du ved en fejltagelse modtager e-mailen, beder
> > > vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den. Selvom e-mailen og ethvert
> > > vedhæftet bilag efter vores overbevisning er fri for virus og
> > > andre fejl, som kan påvirke computeren eller it-systemet, hvori
> > > den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> > > påtager os ikke noget ansvar for tab og skade, som er opstået i
> > > forbindelse med at
> modtage og bruge e-mailen.
> > > >
> > > > Please note that this message may contain confidential information.
> > > > If
> > > you have received this message by mistake, please inform the
> > > sender of the mistake by sending a reply, then delete the message
> > > from your system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the
> > > message at his or
> > her own risk.
> > > We assume no responsibility for any loss or damage arising from
> > > the receipt or use of this message.
> > >
> >
>

Re: Reading data using Tika to Solr

Posted by Tim Allison <ta...@apache.org>.
Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so
no need to re-invent that wheel.  That’s my preferred method/format because
it maintains metadata from attachments and lets you know about exceptions
in embedded files. The legacy method concatenates contents, throws out
attachment metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well using
> the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -----Original Message-----
> From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and now
> it works 😊 But how do I get it to read the attachments as well?
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end 😉 The
> > attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried maven
> > yet but will probably have to give that a chance. I just find it a bit
> > odd that I can see the dependencies are included in the jar files I
> > added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -----Original Message-----
> > From: Tim Allison <ta...@apache.org>
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of jar
> > hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > <er...@gmail.com>
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a silly
> > > question), how are you executing the program .vs. compiling it? You
> > > mentioned the "build path". I'm usually lazy and just execute it in
> > > IntelliJ for development and have forgotten to set my classpath on
> > > _numerous_ occasions when running it from a command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > <MH...@kmd.dk>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both tika-core and tika-parsers are
> > > included
> > in the build path.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I am using Java with the following code:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > public static void main(final String[] args) throws
> > > IOException,SAXException, TikaException {
> > > >
> > > >
> > > >
> > > >                             processDocument(pathtofile)
> > > >
> > > >
> > > >
> > > >                              }
> > > >
> > > >
> > > >
> > > >                             private static void
> > > > processDocument(String
> > > pathfilename)  {
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >                                  try {
> > > >
> > > >
> > > >
> > > >                                                         File file
> > > > = new
> > > File(pathfilename);
> > > >
> > > >
> > > >
> > > >                                                         Metadata
> > > > meta =
> > > new Metadata();
> > > >
> > > >
> > > >
> > > >
> > > > InputStream
> > > input = TikaInputStream.get(file);
> > > >
> > > >
> > > >
> > > >
> > > BodyContentHandler handler = new BodyContentHandler();
> > > >
> > > >
> > > >
> > > >                                                         Parser
> > > > parser =
> > > new OfficeParser();
> > > >
> > > >
> > > > ParseContext
> > > context = new ParseContext();
> > > >
> > > >
> > > parser.parse(input, handler, meta, context);
> > > >
> > > >
> > > >
> > > >                                                          String
> > > doccontent = handler.toString();
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >  System.out.println(doccontent);
> > > >
> > > >
> > >  System.out.println(meta);
> > > >
> > > >
> > > >
> > > >                                  }
> > > >
> > > >                              }
> > > >
> > > > In the buildpath I have the following dependencies:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Any help is appreciate.
> > > >
> > > >
> > > >
> > > > Thanks in advance.
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > >
> > > >
> > > > Martin Hansen
> > > >
> > > >
> > > >
> > > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > > finder
> > > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > > oplysninger om dig.
> > > >
> > > > Protection of your personal data is important to us. Here you can
> > > > read
> > > KMD’s Privacy Policy outlining how we process your personal data.
> > > >
> > > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > > information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> > > dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > > Samtidig beder vi dig slette e-mailen i dit system uden at
> > > videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet
> > > bilag efter vores overbevisning er fri for virus og andre fejl, som
> > > kan påvirke computeren eller it-systemet, hvori den modtages og
> > > læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke
> > > noget ansvar for tab og skade, som er opstået i forbindelse med at
> modtage og bruge e-mailen.
> > > >
> > > > Please note that this message may contain confidential information.
> > > > If
> > > you have received this message by mistake, please inform the sender
> > > of the mistake by sending a reply, then delete the message from your
> > > system without making, distributing or retaining any copies of it.
> > > Although we believe that the message and any attachments are free
> > > from viruses and other errors that might affect the computer or
> > > it-system where it is received and read, the recipient opens the
> > > message at his or
> > her own risk.
> > > We assume no responsibility for any loss or damage arising from the
> > > receipt or use of this message.
> > >
> >
>

RE: Reading data using Tika to Solr

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi again,

Never mind, I got manage to get the content of the msg-files as well using the following link as inspiration: https://wiki.apache.org/tika/RecursiveMetadata

But thanks again for all your help!

-----Original Message-----
From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
Sent: 26. oktober 2018 10:14
To: solr-user@lucene.apache.org
Subject: RE: Reading data using Tika to Solr

Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it works 😊 But how do I get it to read the attachments as well?

-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and poi-scratchpad and their dependencies, but then those msgs could have attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> <er...@gmail.com>
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > <MH...@kmd.dk>
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > >                             processDocument(pathtofile)
> > >
> > >
> > >
> > >                              }
> > >
> > >
> > >
> > >                             private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >                                  try {
> > >
> > >
> > >
> > >                                                         File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > >                                                         Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >
> > > InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > >                                                         Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >                                                          String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >                                  }
> > >
> > >                              }
> > >
> > > In the buildpath I have the following dependencies:
> > >
> > >
> > >
> > >
> > >
> > > Any help is appreciate.
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > > Best regards,
> > >
> > >
> > >
> > > Martin Hansen
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder
> > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read
> > KMD’s Privacy Policy outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> > dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > Samtidig beder vi dig slette e-mailen i dit system uden at
> > videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet
> > bilag efter vores overbevisning er fri for virus og andre fejl, som
> > kan påvirke computeren eller it-systemet, hvori den modtages og
> > læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke
> > noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information.
> > > If
> > you have received this message by mistake, please inform the sender
> > of the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free
> > from viruses and other errors that might affect the computer or
> > it-system where it is received and read, the recipient opens the
> > message at his or
> her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
>

RE: Reading data using Tika to Solr

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it works 😊 But how do I get it to read the attachments as well?

-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and poi-scratchpad and their dependencies, but then those msgs could have attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> <er...@gmail.com>
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > <MH...@kmd.dk>
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > >                             processDocument(pathtofile)
> > >
> > >
> > >
> > >                              }
> > >
> > >
> > >
> > >                             private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >                                  try {
> > >
> > >
> > >
> > >                                                         File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > >                                                         Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >
> > > InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > >                                                         Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >                                                          String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >                                  }
> > >
> > >                              }
> > >
> > > In the buildpath I have the following dependencies:
> > >
> > >
> > >
> > >
> > >
> > > Any help is appreciate.
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > > Best regards,
> > >
> > >
> > >
> > > Martin Hansen
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder
> > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read
> > KMD’s Privacy Policy outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> > dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > Samtidig beder vi dig slette e-mailen i dit system uden at
> > videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet
> > bilag efter vores overbevisning er fri for virus og andre fejl, som
> > kan påvirke computeren eller it-systemet, hvori den modtages og
> > læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke
> > noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information.
> > > If
> > you have received this message by mistake, please inform the sender
> > of the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free
> > from viruses and other errors that might affect the computer or
> > it-system where it is received and read, the recipient opens the
> > message at his or
> her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
>

Re: Reading data using Tika to Solr

Posted by Tim Allison <ta...@apache.org>.
If you’re processing actual msg (not eml), you’ll also need poi and
poi-scratchpad and their dependencies, but then those msgs could have
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the way
> through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven yet
> but will probably have to give that a chance. I just find it a bit odd that
> I can see the dependencies are included in the jar files I added to the
> project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -----Original Message-----
> From: Tim Allison <ta...@apache.org>
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive dependencies
> from tika-parsers. If you aren’t using maven or similar build system to
> grab the dependencies, it can be tricky to get it right. If you aren’t
> using maven, and you can afford the risks of jar hell, consider using
> tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson <er...@gmail.com>
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments, your
> > png didn't come though. You might also get a more informed answer on
> > the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index these
> > > in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > >                             processDocument(pathtofile)
> > >
> > >
> > >
> > >                              }
> > >
> > >
> > >
> > >                             private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >                                  try {
> > >
> > >
> > >
> > >                                                         File file =
> > > new
> > File(pathfilename);
> > >
> > >
> > >
> > >                                                         Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >                                                          InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > >                                                         Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >                                                          String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >                                  }
> > >
> > >                              }
> > >
> > > In the buildpath I have the following dependencies:
> > >
> > >
> > >
> > >
> > >
> > > Any help is appreciate.
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > > Best regards,
> > >
> > >
> > >
> > > Martin Hansen
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder
> > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read
> > KMD’s Privacy Policy outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> > dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > Samtidig beder vi dig slette e-mailen i dit system uden at videresende
> > eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter
> > vores overbevisning er fri for virus og andre fejl, som kan påvirke
> > computeren eller it-systemet, hvori den modtages og læses, åbnes den
> > på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og
> > skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information.
> > > If
> > you have received this message by mistake, please inform the sender of
> > the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free from
> > viruses and other errors that might affect the computer or it-system
> > where it is received and read, the recipient opens the message at his or
> her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
>

RE: Reading data using Tika to Solr

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Erick and Tim,

Thanks for your answers, I can see that my mail got messed up on the way through the server. It looked much more readable at my end 😉 The attachment simply included my build-path.

@Erick I am compiling the program using Netbeans at the moment.

I updated to tika-1.7 but that did not help, and I haven't tried maven yet but will probably have to give that a chance. I just find it a bit odd that I can see the dependencies are included in the jar files I added to the project, but I must be missing something?

My buildpath looks as follows:

Tika-parsers-1.4.jar
Tika-core-1.4.jar
Commons-io-2.5.jar
Httpclient-4.5.3
Httpcore-4.4.6.jar
Httpmime-4.5.3.jar
Slf4j-api1-7-24.jar
Jcl-over--slf4j-1.7.24.jar
Solr-cell-7.5.0.jar
Solr-core-7.5.0.jar
Solr-solrj-7.5.0.jar
Noggit-0.8.jar



-----Original Message-----
From: Tim Allison <ta...@apache.org>
Sent: 25. oktober 2018 20:21
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

To follow up w Erick’s point, there are a bunch of transitive dependencies from tika-parsers. If you aren’t using maven or similar build system to grab the dependencies, it can be tricky to get it right. If you aren’t using maven, and you can afford the risks of jar hell, consider using tika-app or, better perhaps, tika-server.

Stay tuned for SOLR-11721...

On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson <er...@gmail.com>
wrote:

> Martin:
>
> The mail server is pretty aggressive about stripping attachments, your
> png didn't come though. You might also get a more informed answer on
> the Tika mailing list.
>
> That said (and remember I can't see your png so this may be a silly
> question), how are you executing the program .vs. compiling it? You
> mentioned the "build path". I'm usually lazy and just execute it in
> IntelliJ for development and have forgotten to set my classpath on
> _numerous_ occasions when running it from a command line ;)
>
> Best,
> Erick
>
> On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > Hi,
> >
> >
> >
> > I am trying to read content of msg-files using Tika and index these
> > in
> Solr, however I am having some problems with the OfficeParser(). I
> keep getting the error java.lang.NoClassDefFoundError for the
> OfficeParcer, even though both tika-core and tika-parsers are included in the build path.
> >
> >
> >
> >
> >
> > I am using Java with the following code:
> >
> >
> >
> >
> >
> > public static void main(final String[] args) throws
> IOException,SAXException, TikaException {
> >
> >
> >
> >                             processDocument(pathtofile)
> >
> >
> >
> >                              }
> >
> >
> >
> >                             private static void
> > processDocument(String
> pathfilename)  {
> >
> >
> >
> >
> >
> >                                  try {
> >
> >
> >
> >                                                         File file =
> > new
> File(pathfilename);
> >
> >
> >
> >                                                         Metadata
> > meta =
> new Metadata();
> >
> >
> >
> >                                                          InputStream
> input = TikaInputStream.get(file);
> >
> >
> >
> >
> BodyContentHandler handler = new BodyContentHandler();
> >
> >
> >
> >                                                         Parser
> > parser =
> new OfficeParser();
> >
> >
> > ParseContext
> context = new ParseContext();
> >
> >
> parser.parse(input, handler, meta, context);
> >
> >
> >
> >                                                          String
> doccontent = handler.toString();
> >
> >
> >
> >
> >
> >
>  System.out.println(doccontent);
> >
> >
>  System.out.println(meta);
> >
> >
> >
> >                                  }
> >
> >                              }
> >
> > In the buildpath I have the following dependencies:
> >
> >
> >
> >
> >
> > Any help is appreciate.
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Martin Hansen
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder
> du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can
> > read
> KMD’s Privacy Policy outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> Samtidig beder vi dig slette e-mailen i dit system uden at videresende
> eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter
> vores overbevisning er fri for virus og andre fejl, som kan påvirke
> computeren eller it-systemet, hvori den modtages og læses, åbnes den
> på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og
> skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information.
> > If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>

Re: Reading data using Tika to Solr

Posted by Tim Allison <ta...@apache.org>.
To follow up w Erick’s point, there are a bunch of transitive dependencies
from tika-parsers. If you aren’t using maven or similar build system to
grab the dependencies, it can be tricky to get it right. If you aren’t
using maven, and you can afford the risks of jar hell, consider using
tika-app or, better perhaps, tika-server.

Stay tuned for SOLR-11721...

On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson <er...@gmail.com>
wrote:

> Martin:
>
> The mail server is pretty aggressive about stripping attachments, your
> png didn't come though. You might also get a more informed answer on
> the Tika mailing list.
>
> That said (and remember I can't see your png so this may be a silly
> question), how are you executing the program .vs. compiling it? You
> mentioned the "build path". I'm usually lazy and just execute it in
> IntelliJ for development and have forgotten to set my classpath on
> _numerous_ occasions when running it from a command line ;)
>
> Best,
> Erick
>
> On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
> >
> > Hi,
> >
> >
> >
> > I am trying to read content of msg-files using Tika and index these in
> Solr, however I am having some problems with the OfficeParser(). I keep
> getting the error java.lang.NoClassDefFoundError for the OfficeParcer, even
> though both tika-core and tika-parsers are included in the build path.
> >
> >
> >
> >
> >
> > I am using Java with the following code:
> >
> >
> >
> >
> >
> > public static void main(final String[] args) throws
> IOException,SAXException, TikaException {
> >
> >
> >
> >                             processDocument(pathtofile)
> >
> >
> >
> >                              }
> >
> >
> >
> >                             private static void processDocument(String
> pathfilename)  {
> >
> >
> >
> >
> >
> >                                  try {
> >
> >
> >
> >                                                         File file = new
> File(pathfilename);
> >
> >
> >
> >                                                         Metadata meta =
> new Metadata();
> >
> >
> >
> >                                                          InputStream
> input = TikaInputStream.get(file);
> >
> >
> >
> >
> BodyContentHandler handler = new BodyContentHandler();
> >
> >
> >
> >                                                         Parser parser =
> new OfficeParser();
> >
> >                                                          ParseContext
> context = new ParseContext();
> >
> >
> parser.parse(input, handler, meta, context);
> >
> >
> >
> >                                                          String
> doccontent = handler.toString();
> >
> >
> >
> >
> >
> >
>  System.out.println(doccontent);
> >
> >
>  System.out.println(meta);
> >
> >
> >
> >                                  }
> >
> >                              }
> >
> > In the buildpath I have the following dependencies:
> >
> >
> >
> >
> >
> > Any help is appreciate.
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Martin Hansen
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder
> du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler oplysninger
> om dig.
> >
> > Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig
> venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere
> den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget
> ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i
> forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

Re: Reading data using Tika to Solr

Posted by Erick Erickson <er...@gmail.com>.
Martin:

The mail server is pretty aggressive about stripping attachments, your
png didn't come though. You might also get a more informed answer on
the Tika mailing list.

That said (and remember I can't see your png so this may be a silly
question), how are you executing the program .vs. compiling it? You
mentioned the "build path". I'm usually lazy and just execute it in
IntelliJ for development and have forgotten to set my classpath on
_numerous_ occasions when running it from a command line ;)

Best,
Erick

On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) <MH...@kmd.dk> wrote:
>
> Hi,
>
>
>
> I am trying to read content of msg-files using Tika and index these in Solr, however I am having some problems with the OfficeParser(). I keep getting the error java.lang.NoClassDefFoundError for the OfficeParcer, even though both tika-core and tika-parsers are included in the build path.
>
>
>
>
>
> I am using Java with the following code:
>
>
>
>
>
> public static void main(final String[] args) throws IOException,SAXException, TikaException {
>
>
>
>                             processDocument(pathtofile)
>
>
>
>                              }
>
>
>
>                             private static void processDocument(String pathfilename)  {
>
>
>
>
>
>                                  try {
>
>
>
>                                                         File file = new File(pathfilename);
>
>
>
>                                                         Metadata meta = new Metadata();
>
>
>
>                                                          InputStream input = TikaInputStream.get(file);
>
>
>
>                                                          BodyContentHandler handler = new BodyContentHandler();
>
>
>
>                                                         Parser parser = new OfficeParser();
>
>                                                          ParseContext context = new ParseContext();
>
>                                                          parser.parse(input, handler, meta, context);
>
>
>
>                                                          String doccontent = handler.toString();
>
>
>
>
>
>                                                         System.out.println(doccontent);
>
>                                                         System.out.println(meta);
>
>
>
>                                  }
>
>                              }
>
> In the buildpath I have the following dependencies:
>
>
>
>
>
> Any help is appreciate.
>
>
>
> Thanks in advance.
>
>
>
> Best regards,
>
>
>
> Martin Hansen
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy outlining how we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.