You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk> on 2018/10/10 08:15:03 UTC

DIH for TikaEntityProcessor

Hi,

I am trying to read documents from a file system into Solr, using dataimporthandler but keep getting the following errors:

[cid:image002.png@01D46082.022FF7A0]

Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream

         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

         at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

         at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream

         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

         ... 9 more



[cid:image003.png@01D46082.022FF7A0]

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
         at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
         at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
         ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
         ... 6 more
Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
         ... 9 more


My data-config file looks as follows:

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource" />
  <document>
      <entity name="files" processor="FileListEntityProcessor" baseDir="D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false" dataSource="bin" onError="skip">
        <field column="fileAbsolutePath" name="id" />

        <entity
         name="read_file"
         processor="TikaEntityProcessor"
         url="${files.fileAbsolutePath}"
         >
          <field column="text" name="content" />
        </entity>
      </entity>
  </document>
</dataConfig>

And in the Schema I basically have two fields:

<field name="Id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Any help is appreciated.


Martin Frank Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.

Fwd: DIH for TikaEntityProcessor

Posted by Oleg Tikhonov <ol...@gmail.com>.
---------- Forwarded message ---------
From: Martin Frank Hansen (MHQ) <MH...@kmd.dk>
Date: Wed, Oct 10, 2018, 11:15
Subject: DIH for TikaEntityProcessor
To: solr-user@lucene.apache.org <so...@lucene.apache.org>


Hi,



I am trying to read documents from a file system into Solr, using
dataimporthandler but keep getting the following errors:



Exception while processing: files document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
to java.io.InputStream

         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

         at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

         at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader
cannot be cast to java.io.InputStream

         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

         ... 9 more









Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

         at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)

         at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

         at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

         at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

         at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

         at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)

         at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

         at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

         ... 4 more

Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
java.io.InputStream

         at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

         at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

         at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

         at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

         at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

         ... 6 more

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot
be cast to java.io.InputStream

         at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

         at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

         ... 9 more





My data-config file looks as follows:



<dataConfig>

  <dataSource name="bin" type="BinFileDataSource" />

  <document>

      <entity name="files" processor="FileListEntityProcessor" baseDir="
D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false"
dataSource="bin" onError="skip">

        <field column="fileAbsolutePath" name="id" />



        <entity

         name="read_file"

         processor="TikaEntityProcessor"

         url="${files.fileAbsolutePath}"

         >

          <field column="text" name="content" />

        </entity>

      </entity>

  </document>

</dataConfig>



And in the Schema I basically have two fields:



<field name="Id" type="string" indexed="true" stored="true" required="true"
multiValued="false"/>

<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>



Any help is appreciated.





*Martin Frank Hansen*



Beskyttelse af dine personlige oplysninger er vigtig for os. Her
finder du KMD’s
Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s
Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we process
your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
dig slette e-mailen i dit system uden at videresende eller kopiere den.
Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
for virus og andre fejl, som kan påvirke computeren eller it-systemet,
hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you
have received this message by mistake, please inform the sender of the
mistake by sending a reply, then delete the message from your system
without making, distributing or retaining any copies of it. Although we
believe that the message and any attachments are free from viruses and
other errors that might affect the computer or it-system where it is
received and read, the recipient opens the message at his or her own risk.
We assume no responsibility for any loss or damage arising from the receipt
or use of this message.

SV: DIH for TikaEntityProcessor

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi Kamuela,

Thanks for your answer.

I still get the same error, so I think I will try with the tech-products example to see if it works there as Alexendre suggest in the mail above.

Martin Frank Hansen,

-----Oprindelig meddelelse-----
Fra: Kamuela Lau <ka...@gmail.com>
Sendt: 12. oktober 2018 11:38
Til: solr-user@lucene.apache.org
Emne: Re: DIH for TikaEntityProcessor

Hi,

I was unable to reproduce the error that you got with the information provided.
Below are the data-config.xml and managed-schema fields I used; the data-config is mostly the same (I think that BinFileDataSource doesn't actually require a dataSource, so I think it's safe to put dataSource="null"):

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource"/>
  <document>
      <entity name="files" processor="FileListEntityProcessor"
baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
rootEntity="false" dataSource="bin" onError="skip">
        <field column="fileAbsolutePath" name="id"/>
        <entity name="read_file" processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}">
          <field column="text" name="text"/>
        </entity>
      </entity>
  </document>
</dataConfig>

And from the managed schema:
    <field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
    <!-- docValues are enabled by default for long type so we don't need to index the version field  -->
    <field name="_version_" type="plong" indexed="false" stored="false"/>
    <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />
    <field name="text" type="text_general" indexed="true" stored="true"
multiValued="true"/>

When I had field column="text" name="content", the documents were still indexed, but the text/content was not (as I had no content field in the schema).
I used the default config, and Solr version 7.5.0; I was able to import the data just fine (I also tested with .*DOC). Is there any other information you can provide that can help me reproduce this error?




On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi again,
>
>
>
> Can anybody help me? Any suggestions to why I am getting the error below?
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ)
> *Sendt:* 10. oktober 2018 10:15
> *Til:* solr-user <so...@lucene.apache.org>
> *Emne:* DIH for TikaEntityProcessor
>
>
>
> Hi,
>
>
>
> I am trying to read documents from a file system into Solr, using
> dataimporthandler but keep getting the following errors:
>
>
>
> Exception while processing: files document :
> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT
> hrow(DataImportHandlerException.java:61)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:270)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:476)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:517)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:415)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja
> va:330)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 233)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
> ter.java:424)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
> va:483)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Data
> Importer.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> cannot be cast to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
> ityProcessor.java:132)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
>
>
>
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 271)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
> ter.java:424)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
> va:483)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Data
> Importer.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:417)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja
> va:330)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 233)
>
>          ... 4 more
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT
> hrow(DataImportHandlerException.java:61)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:270)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:476)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:517)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:415)
>
>          ... 6 more
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> cannot be cast to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
> ityProcessor.java:132)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
> My data-config file looks as follows:
>
>
>
> <dataConfig>
>
>   <dataSource name="bin" type="BinFileDataSource" />
>
>   <document>
>
>       <entity name="files" processor="FileListEntityProcessor" baseDir="
> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false"
> dataSource="bin" onError="skip">
>
>         <field column="fileAbsolutePath" name="id" />
>
>
>
>         <entity
>
>          name="read_file"
>
>          processor="TikaEntityProcessor"
>
>          url="${files.fileAbsolutePath}"
>
>          >
>
>           <field column="text" name="content" />
>
>         </entity>
>
>       </entity>
>
>   </document>
>
> </dataConfig>
>
>
>
> And in the Schema I basically have two fields:
>
>
>
> <field name="Id" type="string" indexed="true" stored="true"
> required="true " multiValued="false"/>
>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
>
>
> Any help is appreciated.
>
>
>
>
>
> *Martin Frank Hansen*
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s Privatlivspolitik
> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>

Re: DIH for TikaEntityProcessor

Posted by Kamuela Lau <ka...@gmail.com>.
Glad to help :)

2018年10月12日(金) 21:10 Martin Frank Hansen (MHQ) <MH...@kmd.dk>:

> You sir just made my day!!!
>
> It worked!!! Thanks a million!
>
>
> Martin Frank Hansen,
>
> -----Oprindelig meddelelse-----
> Fra: Kamuela Lau <ka...@gmail.com>
> Sendt: 12. oktober 2018 11:41
> Til: solr-user@lucene.apache.org
> Emne: Re: DIH for TikaEntityProcessor
>
> Also, just wondering, have you have tried to specify dataSource="bin" for
> read_file?
>
> On Fri, Oct 12, 2018 at 6:38 PM Kamuela Lau <ka...@gmail.com> wrote:
>
> > Hi,
> >
> > I was unable to reproduce the error that you got with the information
> > provided.
> > Below are the data-config.xml and managed-schema fields I used; the
> > data-config is mostly the same (I think that BinFileDataSource doesn't
> > actually require a dataSource, so I think it's safe to put
> > dataSource="null"):
> >
> > <dataConfig>
> >   <dataSource name="bin" type="BinFileDataSource"/>
> >   <document>
> >       <entity name="files" processor="FileListEntityProcessor"
> > baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
> > rootEntity="false" dataSource="bin" onError="skip">
> >         <field column="fileAbsolutePath" name="id"/>
> >         <entity name="read_file" processor="TikaEntityProcessor"
> > url="${files.fileAbsolutePath}">
> >           <field column="text" name="text"/>
> >         </entity>
> >       </entity>
> >   </document>
> > </dataConfig>
> >
> > And from the managed schema:
> >     <field name="id" type="string" indexed="true" stored="true"
> > required="true" multiValued="false" />
> >     <!-- docValues are enabled by default for long type so we don't
> > need to index the version field  -->
> >     <field name="_version_" type="plong" indexed="false" stored="false"/>
> >     <field name="_root_" type="string" indexed="true" stored="false"
> > docValues="false" />
> >     <field name="text" type="text_general" indexed="true" stored="true"
> > multiValued="true"/>
> >
> > When I had field column="text" name="content", the documents were
> > still indexed, but the text/content was not (as I had no content field
> > in the schema).
> > I used the default config, and Solr version 7.5.0; I was able to
> > import the data just fine (I also tested with .*DOC). Is there any
> > other information you can provide that can help me reproduce this error?
> >
> >
> >
> >
> > On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> > wrote:
> >
> >> Hi again,
> >>
> >>
> >>
> >> Can anybody help me? Any suggestions to why I am getting the error
> below?
> >>
> >>
> >>
> >>
> >>
> >> *Martin Frank Hansen*, Senior Data Analytiker
> >>
> >> Data, IM & Analytics
> >>
> >> [image: cid:image001.png@01D383C9.6C129A60]
> >>
> >>
> >> Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> >> www.kmd.dk Mobil +4525571418
> >>
> >>
> >>
> >> *Fra:* Martin Frank Hansen (MHQ)
> >> *Sendt:* 10. oktober 2018 10:15
> >> *Til:* solr-user <so...@lucene.apache.org>
> >> *Emne:* DIH for TikaEntityProcessor
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> I am trying to read documents from a file system into Solr, using
> >> dataimporthandler but keep getting the following errors:
> >>
> >>
> >>
> >> Exception while processing: files document :
> >> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
> >> Throw(DataImportHandlerException.java:61)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:270)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:476)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:517)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:415)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
> >> ava:330)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :233)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
> >> rter.java:424)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
> >> ava:483)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
> >> aImporter.java:466)
> >>
> >>          at java.lang.Thread.run(Thread.java:748)
> >>
> >> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> >> cannot be cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> >> tityProcessor.java:132)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:267)
> >>
> >>          ... 9 more
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Full Import failed:java.lang.RuntimeException:
> >> java.lang.RuntimeException:
> >> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :271)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
> >> rter.java:424)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
> >> ava:483)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
> >> aImporter.java:466)
> >>
> >>          at java.lang.Thread.run(Thread.java:748)
> >>
> >> Caused by: java.lang.RuntimeException:
> >> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:417)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
> >> ava:330)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :233)
> >>
> >>          ... 4 more
> >>
> >> Caused by:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
> >> Throw(DataImportHandlerException.java:61)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:270)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:476)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:517)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:415)
> >>
> >>          ... 6 more
> >>
> >> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> >> cannot be cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> >> tityProcessor.java:132)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:267)
> >>
> >>          ... 9 more
> >>
> >>
> >>
> >>
> >>
> >> My data-config file looks as follows:
> >>
> >>
> >>
> >> <dataConfig>
> >>
> >>   <dataSource name="bin" type="BinFileDataSource" />
> >>
> >>   <document>
> >>
> >>       <entity name="files" processor="FileListEntityProcessor" baseDir="
> >> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true"
> >> rootEntity="false " dataSource="bin" onError="skip">
> >>
> >>         <field column="fileAbsolutePath" name="id" />
> >>
> >>
> >>
> >>         <entity
> >>
> >>          name="read_file"
> >>
> >>          processor="TikaEntityProcessor"
> >>
> >>          url="${files.fileAbsolutePath}"
> >>
> >>          >
> >>
> >>           <field column="text" name="content" />
> >>
> >>         </entity>
> >>
> >>       </entity>
> >>
> >>   </document>
> >>
> >> </dataConfig>
> >>
> >>
> >>
> >> And in the Schema I basically have two fields:
> >>
> >>
> >>
> >> <field name="Id" type="string" indexed="true" stored="true" required="
> >> true" multiValued="false"/>
> >>
> >> <field name="text" type="text_general" indexed="true" stored="false"
> >> multiValued="true"/>
> >>
> >>
> >>
> >> Any help is appreciated.
> >>
> >>
> >>
> >>
> >>
> >> *Martin Frank Hansen*
> >>
> >>
> >>
> >> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> >> finder du KMD’s Privatlivspolitik
> >> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> >>
> >> Protection of your personal data is important to us. Here you can
> >> read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> >> outlining how we process your personal data.
> >>
> >> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> >> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> >> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> >> beder vi dig slette e-mailen i dit system uden at videresende eller
> kopiere den.
> >> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> >> er fri for virus og andre fejl, som kan påvirke computeren eller
> >> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> >> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> >> opstået i forbindelse med at modtage og bruge e-mailen.
> >>
> >> Please note that this message may contain confidential information.
> >> If you have received this message by mistake, please inform the
> >> sender of the mistake by sending a reply, then delete the message
> >> from your system without making, distributing or retaining any copies
> >> of it. Although we believe that the message and any attachments are
> >> free from viruses and other errors that might affect the computer or
> >> it-system where it is received and read, the recipient opens the
> message at his or her own risk.
> >> We assume no responsibility for any loss or damage arising from the
> >> receipt or use of this message.
> >>
> >
>

SV: DIH for TikaEntityProcessor

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
You sir just made my day!!!

It worked!!! Thanks a million!


Martin Frank Hansen,

-----Oprindelig meddelelse-----
Fra: Kamuela Lau <ka...@gmail.com>
Sendt: 12. oktober 2018 11:41
Til: solr-user@lucene.apache.org
Emne: Re: DIH for TikaEntityProcessor

Also, just wondering, have you have tried to specify dataSource="bin" for read_file?

On Fri, Oct 12, 2018 at 6:38 PM Kamuela Lau <ka...@gmail.com> wrote:

> Hi,
>
> I was unable to reproduce the error that you got with the information
> provided.
> Below are the data-config.xml and managed-schema fields I used; the
> data-config is mostly the same (I think that BinFileDataSource doesn't
> actually require a dataSource, so I think it's safe to put
> dataSource="null"):
>
> <dataConfig>
>   <dataSource name="bin" type="BinFileDataSource"/>
>   <document>
>       <entity name="files" processor="FileListEntityProcessor"
> baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
> rootEntity="false" dataSource="bin" onError="skip">
>         <field column="fileAbsolutePath" name="id"/>
>         <entity name="read_file" processor="TikaEntityProcessor"
> url="${files.fileAbsolutePath}">
>           <field column="text" name="text"/>
>         </entity>
>       </entity>
>   </document>
> </dataConfig>
>
> And from the managed schema:
>     <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>     <!-- docValues are enabled by default for long type so we don't
> need to index the version field  -->
>     <field name="_version_" type="plong" indexed="false" stored="false"/>
>     <field name="_root_" type="string" indexed="true" stored="false"
> docValues="false" />
>     <field name="text" type="text_general" indexed="true" stored="true"
> multiValued="true"/>
>
> When I had field column="text" name="content", the documents were
> still indexed, but the text/content was not (as I had no content field
> in the schema).
> I used the default config, and Solr version 7.5.0; I was able to
> import the data just fine (I also tested with .*DOC). Is there any
> other information you can provide that can help me reproduce this error?
>
>
>
>
> On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
>> Hi again,
>>
>>
>>
>> Can anybody help me? Any suggestions to why I am getting the error below?
>>
>>
>>
>>
>>
>> *Martin Frank Hansen*, Senior Data Analytiker
>>
>> Data, IM & Analytics
>>
>> [image: cid:image001.png@01D383C9.6C129A60]
>>
>>
>> Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
>> www.kmd.dk Mobil +4525571418
>>
>>
>>
>> *Fra:* Martin Frank Hansen (MHQ)
>> *Sendt:* 10. oktober 2018 10:15
>> *Til:* solr-user <so...@lucene.apache.org>
>> *Emne:* DIH for TikaEntityProcessor
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to read documents from a file system into Solr, using
>> dataimporthandler but keep getting the following errors:
>>
>>
>>
>> Exception while processing: files document :
>> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be
>> cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> Throw(DataImportHandlerException.java:61)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:270)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:476)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:517)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:415)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> ava:330)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :233)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> rter.java:424)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> ava:483)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
>> aImporter.java:466)
>>
>>          at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
>> cannot be cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> tityProcessor.java:132)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:267)
>>
>>          ... 9 more
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Full Import failed:java.lang.RuntimeException:
>> java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be
>> cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :271)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> rter.java:424)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> ava:483)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
>> aImporter.java:466)
>>
>>          at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be
>> cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:417)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> ava:330)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :233)
>>
>>          ... 4 more
>>
>> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be
>> cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> Throw(DataImportHandlerException.java:61)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:270)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:476)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:517)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:415)
>>
>>          ... 6 more
>>
>> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
>> cannot be cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> tityProcessor.java:132)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:267)
>>
>>          ... 9 more
>>
>>
>>
>>
>>
>> My data-config file looks as follows:
>>
>>
>>
>> <dataConfig>
>>
>>   <dataSource name="bin" type="BinFileDataSource" />
>>
>>   <document>
>>
>>       <entity name="files" processor="FileListEntityProcessor" baseDir="
>> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true"
>> rootEntity="false " dataSource="bin" onError="skip">
>>
>>         <field column="fileAbsolutePath" name="id" />
>>
>>
>>
>>         <entity
>>
>>          name="read_file"
>>
>>          processor="TikaEntityProcessor"
>>
>>          url="${files.fileAbsolutePath}"
>>
>>          >
>>
>>           <field column="text" name="content" />
>>
>>         </entity>
>>
>>       </entity>
>>
>>   </document>
>>
>> </dataConfig>
>>
>>
>>
>> And in the Schema I basically have two fields:
>>
>>
>>
>> <field name="Id" type="string" indexed="true" stored="true" required="
>> true" multiValued="false"/>
>>
>> <field name="text" type="text_general" indexed="true" stored="false"
>> multiValued="true"/>
>>
>>
>>
>> Any help is appreciated.
>>
>>
>>
>>
>>
>> *Martin Frank Hansen*
>>
>>
>>
>> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
>> finder du KMD’s Privatlivspolitik
>> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.
>>
>> Protection of your personal data is important to us. Here you can
>> read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
>> outlining how we process your personal data.
>>
>> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
>> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
>> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
>> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den.
>> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
>> er fri for virus og andre fejl, som kan påvirke computeren eller
>> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
>> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
>> opstået i forbindelse med at modtage og bruge e-mailen.
>>
>> Please note that this message may contain confidential information.
>> If you have received this message by mistake, please inform the
>> sender of the mistake by sending a reply, then delete the message
>> from your system without making, distributing or retaining any copies
>> of it. Although we believe that the message and any attachments are
>> free from viruses and other errors that might affect the computer or
>> it-system where it is received and read, the recipient opens the message at his or her own risk.
>> We assume no responsibility for any loss or damage arising from the
>> receipt or use of this message.
>>
>

Re: DIH for TikaEntityProcessor

Posted by Kamuela Lau <ka...@gmail.com>.
Also, just wondering, have you have tried to specify dataSource="bin" for
read_file?

On Fri, Oct 12, 2018 at 6:38 PM Kamuela Lau <ka...@gmail.com> wrote:

> Hi,
>
> I was unable to reproduce the error that you got with the information
> provided.
> Below are the data-config.xml and managed-schema fields I used; the
> data-config is mostly the same
> (I think that BinFileDataSource doesn't actually require a dataSource, so
> I think it's safe to put dataSource="null"):
>
> <dataConfig>
>   <dataSource name="bin" type="BinFileDataSource"/>
>   <document>
>       <entity name="files" processor="FileListEntityProcessor"
> baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
> rootEntity="false" dataSource="bin" onError="skip">
>         <field column="fileAbsolutePath" name="id"/>
>         <entity name="read_file" processor="TikaEntityProcessor"
> url="${files.fileAbsolutePath}">
>           <field column="text" name="text"/>
>         </entity>
>       </entity>
>   </document>
> </dataConfig>
>
> And from the managed schema:
>     <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>     <!-- docValues are enabled by default for long type so we don't need
> to index the version field  -->
>     <field name="_version_" type="plong" indexed="false" stored="false"/>
>     <field name="_root_" type="string" indexed="true" stored="false"
> docValues="false" />
>     <field name="text" type="text_general" indexed="true" stored="true"
> multiValued="true"/>
>
> When I had field column="text" name="content", the documents were still
> indexed, but the text/content was not (as I had no content field in the
> schema).
> I used the default config, and Solr version 7.5.0; I was able to import
> the data just fine (I also tested with .*DOC). Is there any other
> information you can provide that can help me reproduce this error?
>
>
>
>
> On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
> wrote:
>
>> Hi again,
>>
>>
>>
>> Can anybody help me? Any suggestions to why I am getting the error below?
>>
>>
>>
>>
>>
>> *Martin Frank Hansen*, Senior Data Analytiker
>>
>> Data, IM & Analytics
>>
>> [image: cid:image001.png@01D383C9.6C129A60]
>>
>>
>> Lautrupparken 40-42, DK-2750 Ballerup
>> E-mail mhq@kmd.dk  Web www.kmd.dk
>> Mobil +4525571418
>>
>>
>>
>> *Fra:* Martin Frank Hansen (MHQ)
>> *Sendt:* 10. oktober 2018 10:15
>> *Til:* solr-user <so...@lucene.apache.org>
>> *Emne:* DIH for TikaEntityProcessor
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to read documents from a file system into Solr, using
>> dataimporthandler but keep getting the following errors:
>>
>>
>>
>> Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>>
>>          at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>>
>>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>>
>>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>>
>>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>>
>>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>>
>>          at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>>
>>          at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>>
>>          at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>>
>>          at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>>
>>          at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>>
>>          at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>>
>>          at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>>
>>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>>
>>          ... 9 more
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Full Import failed:java.lang.RuntimeException:
>> java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
>> java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>>
>>          at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
>> java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>>
>>          ... 4 more
>>
>> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
>> java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>>
>>          at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>>
>>          ... 6 more
>>
>> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot
>> be cast to java.io.InputStream
>>
>>          at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>>
>>          at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>>
>>          ... 9 more
>>
>>
>>
>>
>>
>> My data-config file looks as follows:
>>
>>
>>
>> <dataConfig>
>>
>>   <dataSource name="bin" type="BinFileDataSource" />
>>
>>   <document>
>>
>>       <entity name="files" processor="FileListEntityProcessor" baseDir="
>> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false
>> " dataSource="bin" onError="skip">
>>
>>         <field column="fileAbsolutePath" name="id" />
>>
>>
>>
>>         <entity
>>
>>          name="read_file"
>>
>>          processor="TikaEntityProcessor"
>>
>>          url="${files.fileAbsolutePath}"
>>
>>          >
>>
>>           <field column="text" name="content" />
>>
>>         </entity>
>>
>>       </entity>
>>
>>   </document>
>>
>> </dataConfig>
>>
>>
>>
>> And in the Schema I basically have two fields:
>>
>>
>>
>> <field name="Id" type="string" indexed="true" stored="true" required="
>> true" multiValued="false"/>
>>
>> <field name="text" type="text_general" indexed="true" stored="false"
>> multiValued="true"/>
>>
>>
>>
>> Any help is appreciated.
>>
>>
>>
>>
>>
>> *Martin Frank Hansen*
>>
>>
>>
>> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder
>> du KMD’s Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der
>> fortæller, hvordan vi behandler oplysninger om dig.
>>
>> Protection of your personal data is important to us. Here you can read KMD’s
>> Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we
>> process your personal data.
>>
>> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
>> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
>> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
>> dig slette e-mailen i dit system uden at videresende eller kopiere den.
>> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
>> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
>> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
>> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
>> med at modtage og bruge e-mailen.
>>
>> Please note that this message may contain confidential information. If
>> you have received this message by mistake, please inform the sender of the
>> mistake by sending a reply, then delete the message from your system
>> without making, distributing or retaining any copies of it. Although we
>> believe that the message and any attachments are free from viruses and
>> other errors that might affect the computer or it-system where it is
>> received and read, the recipient opens the message at his or her own risk.
>> We assume no responsibility for any loss or damage arising from the receipt
>> or use of this message.
>>
>

Re: DIH for TikaEntityProcessor

Posted by Kamuela Lau <ka...@gmail.com>.
Hi,

I was unable to reproduce the error that you got with the information
provided.
Below are the data-config.xml and managed-schema fields I used; the
data-config is mostly the same
(I think that BinFileDataSource doesn't actually require a dataSource, so I
think it's safe to put dataSource="null"):

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource"/>
  <document>
      <entity name="files" processor="FileListEntityProcessor"
baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
rootEntity="false" dataSource="bin" onError="skip">
        <field column="fileAbsolutePath" name="id"/>
        <entity name="read_file" processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}">
          <field column="text" name="text"/>
        </entity>
      </entity>
  </document>
</dataConfig>

And from the managed schema:
    <field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
    <!-- docValues are enabled by default for long type so we don't need to
index the version field  -->
    <field name="_version_" type="plong" indexed="false" stored="false"/>
    <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />
    <field name="text" type="text_general" indexed="true" stored="true"
multiValued="true"/>

When I had field column="text" name="content", the documents were still
indexed, but the text/content was not (as I had no content field in the
schema).
I used the default config, and Solr version 7.5.0; I was able to import the
data just fine (I also tested with .*DOC). Is there any other information
you can provide that can help me reproduce this error?




On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MH...@kmd.dk>
wrote:

> Hi again,
>
>
>
> Can anybody help me? Any suggestions to why I am getting the error below?
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ)
> *Sendt:* 10. oktober 2018 10:15
> *Til:* solr-user <so...@lucene.apache.org>
> *Emne:* DIH for TikaEntityProcessor
>
>
>
> Hi,
>
>
>
> I am trying to read documents from a file system into Solr, using
> dataimporthandler but keep getting the following errors:
>
>
>
> Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>
>          at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>
>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>
>          at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>
>          at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>
>          at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>
>          at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>
>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
>
>
>
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>
>          ... 4 more
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>
>          ... 6 more
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot
> be cast to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
> My data-config file looks as follows:
>
>
>
> <dataConfig>
>
>   <dataSource name="bin" type="BinFileDataSource" />
>
>   <document>
>
>       <entity name="files" processor="FileListEntityProcessor" baseDir="
> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false"
> dataSource="bin" onError="skip">
>
>         <field column="fileAbsolutePath" name="id" />
>
>
>
>         <entity
>
>          name="read_file"
>
>          processor="TikaEntityProcessor"
>
>          url="${files.fileAbsolutePath}"
>
>          >
>
>           <field column="text" name="content" />
>
>         </entity>
>
>       </entity>
>
>   </document>
>
> </dataConfig>
>
>
>
> And in the Schema I basically have two fields:
>
>
>
> <field name="Id" type="string" indexed="true" stored="true" required="true
> " multiValued="false"/>
>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
>
>
> Any help is appreciated.
>
>
>
>
>
> *Martin Frank Hansen*
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s
> Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s
> Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

Re: DIH for TikaEntityProcessor

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Solr ships with DIH Tika example that seems 90% identical to yours. Can you
get that to run? If it works, then you can focus on the 10% difference.

Perhaps it is explicit dataSource=null in the outer entity? Or maybe
format=text on the inner one.

Regards,
     Alex


On Fri, Oct 12, 2018, 3:11 AM Martin Frank Hansen (MHQ), <MH...@kmd.dk> wrote:

> Hi again,
>
>
>
> Can anybody help me? Any suggestions to why I am getting the error below?
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail mhq@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ)
> *Sendt:* 10. oktober 2018 10:15
> *Til:* solr-user <so...@lucene.apache.org>
> *Emne:* DIH for TikaEntityProcessor
>
>
>
> Hi,
>
>
>
> I am trying to read documents from a file system into Solr, using
> dataimporthandler but keep getting the following errors:
>
>
>
> Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>
>          at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>
>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>
>          at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>
>          at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>
>          at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>
>          at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
>
>          at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>
>          at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
>
>
>
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>
>          at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>
>          at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>
>          ... 4 more
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to
> java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>
>          at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>
>          ... 6 more
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot
> be cast to java.io.InputStream
>
>          at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
>
>          at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>
>          ... 9 more
>
>
>
>
>
> My data-config file looks as follows:
>
>
>
> <dataConfig>
>
>   <dataSource name="bin" type="BinFileDataSource" />
>
>   <document>
>
>       <entity name="files" processor="FileListEntityProcessor" baseDir="
> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false"
> dataSource="bin" onError="skip">
>
>         <field column="fileAbsolutePath" name="id" />
>
>
>
>         <entity
>
>          name="read_file"
>
>          processor="TikaEntityProcessor"
>
>          url="${files.fileAbsolutePath}"
>
>          >
>
>           <field column="text" name="content" />
>
>         </entity>
>
>       </entity>
>
>   </document>
>
> </dataConfig>
>
>
>
> And in the Schema I basically have two fields:
>
>
>
> <field name="Id" type="string" indexed="true" stored="true" required="true
> " multiValued="false"/>
>
> <field name="text" type="text_general" indexed="true" stored="false"
> multiValued="true"/>
>
>
>
> Any help is appreciated.
>
>
>
>
>
> *Martin Frank Hansen*
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s
> Privatlivspolitik <http://www.kmd.dk/Privatlivspolitik>, der fortæller,
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s
> Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

SV: DIH for TikaEntityProcessor

Posted by "Martin Frank Hansen (MHQ)" <MH...@kmd.dk>.
Hi again,

Can anybody help me? Any suggestions to why I am getting the error below?


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail mhq@kmd.dk<ma...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418

Fra: Martin Frank Hansen (MHQ)
Sendt: 10. oktober 2018 10:15
Til: solr-user <so...@lucene.apache.org>
Emne: DIH for TikaEntityProcessor

Hi,

I am trying to read documents from a file system into Solr, using dataimporthandler but keep getting the following errors:

[cid:image002.png@01D4620B.90013EB0]

Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream

         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

         at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

         at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream

         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

         ... 9 more



[cid:image003.png@01D4620B.90013EB0]

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
         at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
         at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
         at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
         at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
         at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
         at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
         ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
         at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
         ... 6 more
Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to java.io.InputStream
         at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
         at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
         ... 9 more


My data-config file looks as follows:

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource" />
  <document>
      <entity name="files" processor="FileListEntityProcessor" baseDir="D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true" rootEntity="false" dataSource="bin" onError="skip">
        <field column="fileAbsolutePath" name="id" />

        <entity
         name="read_file"
         processor="TikaEntityProcessor"
         url="${files.fileAbsolutePath}"
         >
          <field column="text" name="content" />
        </entity>
      </entity>
  </document>
</dataConfig>

And in the Schema I basically have two fields:

<field name="Id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Any help is appreciated.


Martin Frank Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or it-system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.