You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chandan khatua <ch...@nrifintech.com> on 2014/02/24 08:21:10 UTC

Can not index raw binary data stored in Database in BLOB format.

Hi,

 

We have raw binary data stored in database(not word,excel,xml etc files) in
BLOB.

We are trying to index using TikaEntityProcessor but nothing seems to get
indexed.

But the same configuration works when xml/word/excel files are stored in the
BLOB field.

Below is our data-config.xml:

 

<?xml version="1.0" encoding="UTF-8" ?>

<dataConfig>

<dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@//a.a.a.a:a/d11gr21" user="abc" password="abc"
convertType="true"/>

<dataSource name="dastream" type="FieldStreamDataSource" />

<document>

 <entity 

      name="messages" pk=" PK" transformer='DateFormatTransformer'

      query="select * from table1"

      dataSource="db">

                <field column =" PK" name ="id" />

                <field column="last_modified"  dateTimeFormat="YYYY-MM-DD
HH24:MI:SS" locale="en" />

    <entity 

        name="message"

        dataSource="dastream"

        processor="TikaEntityProcessor"

        url="message"

        dataField="messages.MESSAGE"

                                format="text"

        >

                                

        <field column="text" name="mxMsg" blob="true" />

      </entity>

    </entity>

                

 </document>

</dataConfig>

 

Please suggest us the changes required to index binary data.

 

Thanking you,

 

-Chandan


RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
I have verified that blob column is called MESSAGE.
In my data-config file the field column named 'id' is indexed in solr. But
the data(field column  name="mxMsg") is not indexed. It comes empty with in
quotes. 

The same configuration is working on xml data (stored BLOB type in DB), But
not on binary data (stored BLOB type in DB).

Please help.

Thanking you,

- Chandan

-----Original Message-----
From: Raymond Wiker [mailto:rwiker@gmail.com] 
Sent: Monday, February 24, 2014 5:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

Try running the query for the outer entity ("messages") in an sql client,
and verify that your blob column is called MESSAGE.


On Mon, Feb 24, 2014 at 12:22 PM, Chandan khatua
<ch...@nrifintech.com>wrote:

> I've tried as per your guide. But, no data are indexing.
> The output of Query screen looks like :
>
> <doc>
>     <str name="id">2158</str>
>     <arr name="mxMsg">
>       <str><?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Type" content="application/octet-stream"/>
> <title/>
> </head>
> <body/></html></str>
>     </arr>
>     <long name="_version_">1460918369230258176</long></doc>
>
>
>
> But, the indexed data should be displayed within  <body> tag. When xml 
> message are stored in DB in BLOB type, then indexing is done smoothly.
> But, I am trying to index binary data which are stored in DB in BLOB type.
>
> Need help.
>
> Thanking you,
> Chandan
>
>
>
> -----Original Message-----
> From: Raymond Wiker [mailto:rwiker@gmail.com]
> Sent: Monday, February 24, 2014 4:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB 
> format.
>
> Try replacing the inner entity with something like
>
> <entity name="message"
>            dataSource="dastream"
>            processor="TikaEntityProcessor"
>            dataField="messages.MESSAGE"
>            format="xml">
>     <field column="text" name="mxMsg"/>
>   </entity>
>
> --- this assumes that you get the blob from a column named "MESSAGE" 
> in the outer entity ("messages").
>
>
> On Mon, Feb 24, 2014 at 11:51 AM, Chandan khatua
> <ch...@nrifintech.com>wrote:
>
> > Hi Raymond !
> >
> > I've data-config.xml like bellow:
> >
> > <?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource 
> > name="db" driver="oracle.jdbc.driver.OracleDriver"
> > url="jdbc:oracle:thin:@//x.x.x.x:x/d11gr21" user="x" password="x"/> 
> > <dataSource name="dastream" type="FieldStreamDataSource" /> 
> > <document>
> >   <entity
> >       name="messages" pk=" PK" transformer='DateFormatTransformer'
> >       query="select * from table1"
> >       dataSource="db">
> >          <field column =" PK" name ="id" />
> >          <field column="last_modified"  dateTimeFormat="YYYY-MM-DD 
> > HH24:MI:SS" locale="en" />
> >     <entity
> >         name="message"
> >         dataSource="dastream"
> >         processor="TikaEntityProcessor"
> >         url="message"
> >         dataField="db.MESSAGE"
> >                 format="text"
> >         >
> >
> >         <field column="text" name="mxMsg" blob="true"/>
> >       </entity>
> >     </entity>
> >
> >
> >  </document>
> > </dataConfig>
> >
> >
> >
> > This is looks like similar to your configuration. But when xml data 
> > are in BLOB in database, indexing is done. But, when binary data are 
> > in BLOB in database, indexing is NOT done.
> > Please help.
> >
> > Thanking you,
> > -Chandan
> >
> >
> > -----Original Message-----
> > From: Raymond Wiker [mailto:rwiker@gmail.com]
> > Sent: Monday, February 24, 2014 4:06 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can not index raw binary data stored in Database in 
> > BLOB format.
> >
> > I've done something like this; the key was to use a 
> > FieldStreamDataSource to read from the BLOB field.
> >
> > Something like
> >
> > <datasource name="main" ...>
> > <dataSource type="FieldStreamDataSource" name="fieldstream"/>
> >
> > then
> >
> >       <entity name="tika" processor="TikaEntityProcessor"
> > dataField="main.BLOB" dataSource="fieldstream" format="xml">
> >         <field column="Author" meta="true" name="..."/>
> >         <field column="title" meta="true" name="title"/>
> >         <field column="text" name="content"/>
> >         <field column="content_type" name="content_type" meta="true"/>
> >         <field column="last_modified" name="last_modified" meta="true"/>
> >     </entity>
> >
> > ...
> >
> >
> >
> >
> > On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua
> > <ch...@nrifintech.com>wrote:
> >
> > > Hi Gora !
> > >
> > > Your concern was "What is the type of the column used to store the 
> > > binary data in Oracle?"
> > > The column type is BLOB in DB.  The column can also have rich text
> file.
> > >
> > > Regards,
> > > Chandan
> > >
> > >
> > > -----Original Message-----
> > > From: Gora Mohanty [mailto:gora@mimirtech.com]
> > > Sent: Monday, February 24, 2014 3:02 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Can not index raw binary data stored in Database in 
> > > BLOB format.
> > >
> > > On 24 February 2014 12:51, Chandan khatua 
> > > <ch...@nrifintech.com>
> > wrote:
> > > > Hi,
> > > >
> > > >
> > > >
> > > > We have raw binary data stored in database(not word,excel,xml 
> > > > etc
> > > > files) in BLOB.
> > > >
> > > > We are trying to index using TikaEntityProcessor but nothing 
> > > > seems to get indexed.
> > > >
> > > > But the same configuration works when xml/word/excel files are 
> > > > stored in the BLOB field.
> > >
> > > Please start by reviewing
> > > http://wiki.apache.org/solr/DataImportHandler as the above seems 
> > > quite confused. Why are you using TikaEntityProcessor if the data 
> > > in the DB are not richtext files?
> > >
> > > What is the type of the column used to store the binary data in 
> > > Oracle? You might be able to convert it with a ClobTransformer.
> > > Please see
> > > http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
> > >
> > > http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my
> > > _t
> > > ab
> > > le_are
> > > _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
> > >
> > > Regards,
> > > Gora
> > >
> > >
> >
> >
>
>


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Raymond Wiker <rw...@gmail.com>.
Try running the query for the outer entity ("messages") in an sql client,
and verify that your blob column is called MESSAGE.


On Mon, Feb 24, 2014 at 12:22 PM, Chandan khatua <ch...@nrifintech.com>wrote:

> I've tried as per your guide. But, no data are indexing.
> The output of Query screen looks like :
>
> <doc>
>     <str name="id">2158</str>
>     <arr name="mxMsg">
>       <str><?xml version="1.0" encoding="UTF-8"?><html
> xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Type" content="application/octet-stream"/>
> <title/>
> </head>
> <body/></html></str>
>     </arr>
>     <long name="_version_">1460918369230258176</long></doc>
>
>
>
> But, the indexed data should be displayed within  <body> tag. When xml
> message are stored in DB in BLOB type, then indexing is done smoothly.
> But, I am trying to index binary data which are stored in DB in BLOB type.
>
> Need help.
>
> Thanking you,
> Chandan
>
>
>
> -----Original Message-----
> From: Raymond Wiker [mailto:rwiker@gmail.com]
> Sent: Monday, February 24, 2014 4:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB
> format.
>
> Try replacing the inner entity with something like
>
> <entity name="message"
>            dataSource="dastream"
>            processor="TikaEntityProcessor"
>            dataField="messages.MESSAGE"
>            format="xml">
>     <field column="text" name="mxMsg"/>
>   </entity>
>
> --- this assumes that you get the blob from a column named "MESSAGE" in the
> outer entity ("messages").
>
>
> On Mon, Feb 24, 2014 at 11:51 AM, Chandan khatua
> <ch...@nrifintech.com>wrote:
>
> > Hi Raymond !
> >
> > I've data-config.xml like bellow:
> >
> > <?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource
> > name="db" driver="oracle.jdbc.driver.OracleDriver"
> > url="jdbc:oracle:thin:@//x.x.x.x:x/d11gr21" user="x" password="x"/>
> > <dataSource name="dastream" type="FieldStreamDataSource" />
> > <document>
> >   <entity
> >       name="messages" pk=" PK" transformer='DateFormatTransformer'
> >       query="select * from table1"
> >       dataSource="db">
> >          <field column =" PK" name ="id" />
> >          <field column="last_modified"  dateTimeFormat="YYYY-MM-DD
> > HH24:MI:SS" locale="en" />
> >     <entity
> >         name="message"
> >         dataSource="dastream"
> >         processor="TikaEntityProcessor"
> >         url="message"
> >         dataField="db.MESSAGE"
> >                 format="text"
> >         >
> >
> >         <field column="text" name="mxMsg" blob="true"/>
> >       </entity>
> >     </entity>
> >
> >
> >  </document>
> > </dataConfig>
> >
> >
> >
> > This is looks like similar to your configuration. But when xml data
> > are in BLOB in database, indexing is done. But, when binary data are
> > in BLOB in database, indexing is NOT done.
> > Please help.
> >
> > Thanking you,
> > -Chandan
> >
> >
> > -----Original Message-----
> > From: Raymond Wiker [mailto:rwiker@gmail.com]
> > Sent: Monday, February 24, 2014 4:06 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can not index raw binary data stored in Database in BLOB
> > format.
> >
> > I've done something like this; the key was to use a
> > FieldStreamDataSource to read from the BLOB field.
> >
> > Something like
> >
> > <datasource name="main" ...>
> > <dataSource type="FieldStreamDataSource" name="fieldstream"/>
> >
> > then
> >
> >       <entity name="tika" processor="TikaEntityProcessor"
> > dataField="main.BLOB" dataSource="fieldstream" format="xml">
> >         <field column="Author" meta="true" name="..."/>
> >         <field column="title" meta="true" name="title"/>
> >         <field column="text" name="content"/>
> >         <field column="content_type" name="content_type" meta="true"/>
> >         <field column="last_modified" name="last_modified" meta="true"/>
> >     </entity>
> >
> > ...
> >
> >
> >
> >
> > On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua
> > <ch...@nrifintech.com>wrote:
> >
> > > Hi Gora !
> > >
> > > Your concern was "What is the type of the column used to store the
> > > binary data in Oracle?"
> > > The column type is BLOB in DB.  The column can also have rich text
> file.
> > >
> > > Regards,
> > > Chandan
> > >
> > >
> > > -----Original Message-----
> > > From: Gora Mohanty [mailto:gora@mimirtech.com]
> > > Sent: Monday, February 24, 2014 3:02 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Can not index raw binary data stored in Database in
> > > BLOB format.
> > >
> > > On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com>
> > wrote:
> > > > Hi,
> > > >
> > > >
> > > >
> > > > We have raw binary data stored in database(not word,excel,xml etc
> > > > files) in BLOB.
> > > >
> > > > We are trying to index using TikaEntityProcessor but nothing seems
> > > > to get indexed.
> > > >
> > > > But the same configuration works when xml/word/excel files are
> > > > stored in the BLOB field.
> > >
> > > Please start by reviewing
> > > http://wiki.apache.org/solr/DataImportHandler as the above seems
> > > quite confused. Why are you using TikaEntityProcessor if the data in
> > > the DB are not richtext files?
> > >
> > > What is the type of the column used to store the binary data in
> > > Oracle? You might be able to convert it with a ClobTransformer.
> > > Please see
> > > http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
> > >
> > > http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_t
> > > ab
> > > le_are
> > > _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
> > >
> > > Regards,
> > > Gora
> > >
> > >
> >
> >
>
>

RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
I've tried as per your guide. But, no data are indexing.
The output of Query screen looks like :

<doc>
    <str name="id">2158</str>
    <arr name="mxMsg">
      <str><?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Type" content="application/octet-stream"/>
<title/>
</head>
<body/></html></str>
    </arr>
    <long name="_version_">1460918369230258176</long></doc>



But, the indexed data should be displayed within  <body> tag. When xml
message are stored in DB in BLOB type, then indexing is done smoothly. 
But, I am trying to index binary data which are stored in DB in BLOB type.

Need help.

Thanking you,
Chandan



-----Original Message-----
From: Raymond Wiker [mailto:rwiker@gmail.com] 
Sent: Monday, February 24, 2014 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

Try replacing the inner entity with something like

<entity name="message"
           dataSource="dastream"
           processor="TikaEntityProcessor"
           dataField="messages.MESSAGE"
           format="xml">
    <field column="text" name="mxMsg"/>
  </entity>

--- this assumes that you get the blob from a column named "MESSAGE" in the
outer entity ("messages").


On Mon, Feb 24, 2014 at 11:51 AM, Chandan khatua
<ch...@nrifintech.com>wrote:

> Hi Raymond !
>
> I've data-config.xml like bellow:
>
> <?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource 
> name="db" driver="oracle.jdbc.driver.OracleDriver"
> url="jdbc:oracle:thin:@//x.x.x.x:x/d11gr21" user="x" password="x"/>  
> <dataSource name="dastream" type="FieldStreamDataSource" />  
> <document>
>   <entity
>       name="messages" pk=" PK" transformer='DateFormatTransformer'
>       query="select * from table1"
>       dataSource="db">
>          <field column =" PK" name ="id" />
>          <field column="last_modified"  dateTimeFormat="YYYY-MM-DD 
> HH24:MI:SS" locale="en" />
>     <entity
>         name="message"
>         dataSource="dastream"
>         processor="TikaEntityProcessor"
>         url="message"
>         dataField="db.MESSAGE"
>                 format="text"
>         >
>
>         <field column="text" name="mxMsg" blob="true"/>
>       </entity>
>     </entity>
>
>
>  </document>
> </dataConfig>
>
>
>
> This is looks like similar to your configuration. But when xml data 
> are in BLOB in database, indexing is done. But, when binary data are 
> in BLOB in database, indexing is NOT done.
> Please help.
>
> Thanking you,
> -Chandan
>
>
> -----Original Message-----
> From: Raymond Wiker [mailto:rwiker@gmail.com]
> Sent: Monday, February 24, 2014 4:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB 
> format.
>
> I've done something like this; the key was to use a 
> FieldStreamDataSource to read from the BLOB field.
>
> Something like
>
> <datasource name="main" ...>
> <dataSource type="FieldStreamDataSource" name="fieldstream"/>
>
> then
>
>       <entity name="tika" processor="TikaEntityProcessor"
> dataField="main.BLOB" dataSource="fieldstream" format="xml">
>         <field column="Author" meta="true" name="..."/>
>         <field column="title" meta="true" name="title"/>
>         <field column="text" name="content"/>
>         <field column="content_type" name="content_type" meta="true"/>
>         <field column="last_modified" name="last_modified" meta="true"/>
>     </entity>
>
> ...
>
>
>
>
> On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua
> <ch...@nrifintech.com>wrote:
>
> > Hi Gora !
> >
> > Your concern was "What is the type of the column used to store the 
> > binary data in Oracle?"
> > The column type is BLOB in DB.  The column can also have rich text file.
> >
> > Regards,
> > Chandan
> >
> >
> > -----Original Message-----
> > From: Gora Mohanty [mailto:gora@mimirtech.com]
> > Sent: Monday, February 24, 2014 3:02 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can not index raw binary data stored in Database in 
> > BLOB format.
> >
> > On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com>
> wrote:
> > > Hi,
> > >
> > >
> > >
> > > We have raw binary data stored in database(not word,excel,xml etc
> > > files) in BLOB.
> > >
> > > We are trying to index using TikaEntityProcessor but nothing seems 
> > > to get indexed.
> > >
> > > But the same configuration works when xml/word/excel files are 
> > > stored in the BLOB field.
> >
> > Please start by reviewing
> > http://wiki.apache.org/solr/DataImportHandler as the above seems 
> > quite confused. Why are you using TikaEntityProcessor if the data in 
> > the DB are not richtext files?
> >
> > What is the type of the column used to store the binary data in 
> > Oracle? You might be able to convert it with a ClobTransformer. 
> > Please see 
> > http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
> >
> > http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_t
> > ab
> > le_are
> > _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
> >
> > Regards,
> > Gora
> >
> >
>
>


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Raymond Wiker <rw...@gmail.com>.
Try replacing the inner entity with something like

<entity name="message"
           dataSource="dastream"
           processor="TikaEntityProcessor"
           dataField="messages.MESSAGE"
           format="xml">
    <field column="text" name="mxMsg"/>
  </entity>

--- this assumes that you get the blob from a column named "MESSAGE" in the
outer entity ("messages").


On Mon, Feb 24, 2014 at 11:51 AM, Chandan khatua <ch...@nrifintech.com>wrote:

> Hi Raymond !
>
> I've data-config.xml like bellow:
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
> url="jdbc:oracle:thin:@//x.x.x.x:x/d11gr21" user="x" password="x"/>
>  <dataSource name="dastream" type="FieldStreamDataSource" />
>  <document>
>   <entity
>       name="messages" pk=" PK" transformer='DateFormatTransformer'
>       query="select * from table1"
>       dataSource="db">
>          <field column =" PK" name ="id" />
>          <field column="last_modified"  dateTimeFormat="YYYY-MM-DD
> HH24:MI:SS" locale="en" />
>     <entity
>         name="message"
>         dataSource="dastream"
>         processor="TikaEntityProcessor"
>         url="message"
>         dataField="db.MESSAGE"
>                 format="text"
>         >
>
>         <field column="text" name="mxMsg" blob="true"/>
>       </entity>
>     </entity>
>
>
>  </document>
> </dataConfig>
>
>
>
> This is looks like similar to your configuration. But when xml data are in
> BLOB in database, indexing is done. But, when binary data are in BLOB in
> database, indexing is NOT done.
> Please help.
>
> Thanking you,
> -Chandan
>
>
> -----Original Message-----
> From: Raymond Wiker [mailto:rwiker@gmail.com]
> Sent: Monday, February 24, 2014 4:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB
> format.
>
> I've done something like this; the key was to use a FieldStreamDataSource
> to
> read from the BLOB field.
>
> Something like
>
> <datasource name="main" ...>
> <dataSource type="FieldStreamDataSource" name="fieldstream"/>
>
> then
>
>       <entity name="tika" processor="TikaEntityProcessor"
> dataField="main.BLOB" dataSource="fieldstream" format="xml">
>         <field column="Author" meta="true" name="..."/>
>         <field column="title" meta="true" name="title"/>
>         <field column="text" name="content"/>
>         <field column="content_type" name="content_type" meta="true"/>
>         <field column="last_modified" name="last_modified" meta="true"/>
>     </entity>
>
> ...
>
>
>
>
> On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua
> <ch...@nrifintech.com>wrote:
>
> > Hi Gora !
> >
> > Your concern was "What is the type of the column used to store the
> > binary data in Oracle?"
> > The column type is BLOB in DB.  The column can also have rich text file.
> >
> > Regards,
> > Chandan
> >
> >
> > -----Original Message-----
> > From: Gora Mohanty [mailto:gora@mimirtech.com]
> > Sent: Monday, February 24, 2014 3:02 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can not index raw binary data stored in Database in BLOB
> > format.
> >
> > On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com>
> wrote:
> > > Hi,
> > >
> > >
> > >
> > > We have raw binary data stored in database(not word,excel,xml etc
> > > files) in BLOB.
> > >
> > > We are trying to index using TikaEntityProcessor but nothing seems
> > > to get indexed.
> > >
> > > But the same configuration works when xml/word/excel files are
> > > stored in the BLOB field.
> >
> > Please start by reviewing
> > http://wiki.apache.org/solr/DataImportHandler as the above seems quite
> > confused. Why are you using TikaEntityProcessor if the data in the DB
> > are not richtext files?
> >
> > What is the type of the column used to store the binary data in
> > Oracle? You might be able to convert it with a ClobTransformer. Please
> > see http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
> >
> > http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_tab
> > le_are
> > _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
> >
> > Regards,
> > Gora
> >
> >
>
>

RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
Hi Raymond !

I've data-config.xml like bellow:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@//x.x.x.x:x/d11gr21" user="x" password="x"/>
 <dataSource name="dastream" type="FieldStreamDataSource" />
 <document>
  <entity 
      name="messages" pk=" PK" transformer='DateFormatTransformer'
      query="select * from table1"
      dataSource="db">
	 <field column =" PK" name ="id" />
	 <field column="last_modified"  dateTimeFormat="YYYY-MM-DD
HH24:MI:SS" locale="en" />
    <entity 
        name="message"
        dataSource="dastream"
        processor="TikaEntityProcessor"
        url="message"
        dataField="db.MESSAGE"
		format="text"
        >
		
        <field column="text" name="mxMsg" blob="true"/>
      </entity>
    </entity>
	
 
 </document>
</dataConfig>



This is looks like similar to your configuration. But when xml data are in
BLOB in database, indexing is done. But, when binary data are in BLOB in
database, indexing is NOT done.
Please help.

Thanking you,
-Chandan


-----Original Message-----
From: Raymond Wiker [mailto:rwiker@gmail.com] 
Sent: Monday, February 24, 2014 4:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

I've done something like this; the key was to use a FieldStreamDataSource to
read from the BLOB field.

Something like

<datasource name="main" ...>
<dataSource type="FieldStreamDataSource" name="fieldstream"/>

then

      <entity name="tika" processor="TikaEntityProcessor"
dataField="main.BLOB" dataSource="fieldstream" format="xml">
        <field column="Author" meta="true" name="..."/>
        <field column="title" meta="true" name="title"/>
        <field column="text" name="content"/>
        <field column="content_type" name="content_type" meta="true"/>
        <field column="last_modified" name="last_modified" meta="true"/>
    </entity>

...




On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua
<ch...@nrifintech.com>wrote:

> Hi Gora !
>
> Your concern was "What is the type of the column used to store the 
> binary data in Oracle?"
> The column type is BLOB in DB.  The column can also have rich text file.
>
> Regards,
> Chandan
>
>
> -----Original Message-----
> From: Gora Mohanty [mailto:gora@mimirtech.com]
> Sent: Monday, February 24, 2014 3:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB 
> format.
>
> On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com> wrote:
> > Hi,
> >
> >
> >
> > We have raw binary data stored in database(not word,excel,xml etc
> > files) in BLOB.
> >
> > We are trying to index using TikaEntityProcessor but nothing seems 
> > to get indexed.
> >
> > But the same configuration works when xml/word/excel files are 
> > stored in the BLOB field.
>
> Please start by reviewing 
> http://wiki.apache.org/solr/DataImportHandler as the above seems quite 
> confused. Why are you using TikaEntityProcessor if the data in the DB 
> are not richtext files?
>
> What is the type of the column used to store the binary data in 
> Oracle? You might be able to convert it with a ClobTransformer. Please 
> see http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
>
> http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_tab
> le_are
> _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
>
> Regards,
> Gora
>
>


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Raymond Wiker <rw...@gmail.com>.
I've done something like this; the key was to use a FieldStreamDataSource
to read from the BLOB field.

Something like

<datasource name="main" ...>
<dataSource type="FieldStreamDataSource" name="fieldstream"/>

then

      <entity name="tika" processor="TikaEntityProcessor"
dataField="main.BLOB" dataSource="fieldstream" format="xml">
        <field column="Author" meta="true" name="..."/>
        <field column="title" meta="true" name="title"/>
        <field column="text" name="content"/>
        <field column="content_type" name="content_type" meta="true"/>
        <field column="last_modified" name="last_modified" meta="true"/>
    </entity>

...




On Mon, Feb 24, 2014 at 11:04 AM, Chandan khatua <ch...@nrifintech.com>wrote:

> Hi Gora !
>
> Your concern was "What is the type of the column used to store the binary
> data in Oracle?"
> The column type is BLOB in DB.  The column can also have rich text file.
>
> Regards,
> Chandan
>
>
> -----Original Message-----
> From: Gora Mohanty [mailto:gora@mimirtech.com]
> Sent: Monday, February 24, 2014 3:02 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB
> format.
>
> On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com> wrote:
> > Hi,
> >
> >
> >
> > We have raw binary data stored in database(not word,excel,xml etc
> > files) in BLOB.
> >
> > We are trying to index using TikaEntityProcessor but nothing seems to
> > get indexed.
> >
> > But the same configuration works when xml/word/excel files are stored
> > in the BLOB field.
>
> Please start by reviewing http://wiki.apache.org/solr/DataImportHandler as
> the above seems quite confused. Why are you using TikaEntityProcessor if
> the
> data in the DB are not richtext files?
>
> What is the type of the column used to store the binary data in Oracle? You
> might be able to convert it with a ClobTransformer. Please see
> http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
>
> http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are
> _added_to_the_Solr_document_as_object_strings_like_B.401f23c5
>
> Regards,
> Gora
>
>

Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Raymond Wiker <rw...@gmail.com>.
A few things:

1) If your database uses a BLOB, you should not use clobtransformer;
FieldStreamDataSource should be sufficient.

2) In a previous message, it showed that the converted/etxracted document
was empty (except for an html boilerplate wrapper). This was using the
configuration I suggested.

I'm guessing that TikaEntityProcessor is either receiving empty strings as
source, or failing to extract the content of certain file formats. To test
the latter, you could export one of the blobs to a file, and run the
stan-aloen tika app on it.

As to the possibility that TikaEntitiyProcessor is receiving empty strings
as input: I had a similar issue, but with varchars. In my case, the reason
was that I was running a really old version of Oracle, which would not work
with recent versions of the Oracle support libraries.

Another thing that might be worth checking: your main query uses "select *
..." as the main query. Have you tried explicitly listing the columns
you're interested in? Something like "select X_MSG_PK, MESSAGE from table1".


On Tue, Feb 25, 2014 at 1:11 PM, Chandan khatua <ch...@nrifintech.com>wrote:

> Okey.
>
> Here is my data-config file:
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> <dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
> url="jdbc:oracle:thin:@//1.2.3.4:1/d11gr21" user="aaaa" password="aaaa" />
>  <dataSource name="dastream" type="FieldStreamDataSource"/>
>  <document>
>   <entity
>       name="messages" pk="X_MSG_PK"
>       query="select * from table1"
>       dataSource="db">
>          <field column ="X_MSG_PK" name ="id" />
>         <entity name="message"
>                         transformer="ClobTransformer"
>                                 dataSource="dastream"
>                         processor="TikaEntityProcessor"
>                                 dataField="messages.MESSAGE"
>                                  format="text">
>                         <field column="text" name="mxMsg" clob="true"/>
>         </entity>
>     </entity>
>  </document>
> </dataConfig>
>
>
> ----------------------------------------------------------------------------
> ----------------------
>
> Solr.log file :
>
> INFO  - 2014-02-25 17:33:40.023; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/admin/mbeans
> params={cat=QUERYHANDLER&_=1393329819994&wt=json} status=0 QTime=1
> INFO  - 2014-02-25 17:33:40.094; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/admin/mbeans
> params={cat=QUERYHANDLER&_=1393329820083&wt=json} status=0 QTime=0
> INFO  - 2014-02-25 17:33:40.117; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/dataimport
> params={indent=true&command=status&_=1393329820089&wt=json} status=0
> QTime=16
> INFO  - 2014-02-25 17:33:40.131; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/dataimport
> params={indent=true&command=show-config&_=1393329820084} status=0 QTime=29
> INFO  - 2014-02-25 17:33:42.026;
> org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration:
> /dataconfig/data-config.xml
> INFO  - 2014-02-25 17:33:42.031;
> org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded
> successfully
> INFO  - 2014-02-25 17:33:42.033; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/dataimport
>
> params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm
> and=full-import&debug=false&wt=json} status=0 QTime=8
> INFO  - 2014-02-25 17:33:42.035;
> org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
> INFO  - 2014-02-25 17:33:42.043; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/dataimport
> params={indent=true&command=status&_=1393329822040&wt=json} status=0
> QTime=0
>
> INFO  - 2014-02-25 17:33:42.064;
> org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
> dataimport.properties
> INFO  - 2014-02-25 17:33:42.092; org.apache.solr.search.SolrIndexSearcher;
> Opening Searcher@2a858a73 realtime
> INFO  - 2014-02-25 17:33:42.093;
> org.apache.solr.handler.dataimport.JdbcDataSource$1; Creating a connection
> for entity messages with URL: jdbc:oracle:thin:@//
> 172.16.29.92:1521/d11gr21
> INFO  - 2014-02-25 17:33:42.113;
> org.apache.solr.handler.dataimport.JdbcDataSource$1; Time taken for
> getConnection(): 19
> INFO  - 2014-02-25 17:33:42.564;
> org.apache.solr.handler.dataimport.DocBuilder; Import completed
> successfully
> INFO  - 2014-02-25 17:33:42.564;
> org.apache.solr.update.DirectUpdateHandler2; start
>
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=fa
> lse,softCommit=false,prepareCommit=false}
> INFO  - 2014-02-25 17:33:42.867; org.apache.solr.core.SolrDeletionPolicy;
> SolrDeletionPolicy.onCommit: commits: num=2
>
> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C
> :\solr
> -4.5.1\example\multicore\CHESS_CORE\data\index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_l,generation=21}
>
> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C
> :\solr
> -4.5.1\example\multicore\CHESS_CORE\data\index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_m,generation=22}
> INFO  - 2014-02-25 17:33:42.868; org.apache.solr.core.SolrDeletionPolicy;
> newest commit generation = 22
> INFO  - 2014-02-25 17:33:42.882; org.apache.solr.search.SolrIndexSearcher;
> Opening Searcher@558ea0cc main
> INFO  - 2014-02-25 17:33:42.886; org.apache.solr.core.QuerySenderListener;
> QuerySenderListener sending requests to Searcher@558ea0cc
> main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)}
> INFO  - 2014-02-25 17:33:42.889; org.apache.solr.core.QuerySenderListener;
> QuerySenderListener done.
> INFO  - 2014-02-25 17:33:42.889; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> Registered new searcher Searcher@558ea0cc
> main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)}
> INFO  - 2014-02-25 17:33:42.893;
> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> INFO  - 2014-02-25 17:33:42.899;
> org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
> dataimport.properties
> INFO  - 2014-02-25 17:33:42.901;
> org.apache.solr.handler.dataimport.SimplePropertiesWriter; Wrote last
> indexed time to dataimport.properties
> INFO  - 2014-02-25 17:33:42.905;
> org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:0.839
> INFO  - 2014-02-25 17:33:42.905;
> org.apache.solr.update.processor.LogUpdateProcessor; [CHESS_CORE]
> webapp=/solr path=/dataimport
>
> params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm
> and=full-import&debug=false&wt=json} status=0 QTime=8 {deleteByQuery=*:*
> (-1461012211508969472),add=[2158 (1461012211583418368), 2265
> (1461012211591806976), 2225 (1461012211597049856), 2241
> (1461012211602292736), 2276 (1461012211607535616), 2277
> (1461012211612778496), 2302 (1461012211619069952), 4558
> (1461012211624312832), 2144 (1461012211629555712), 2145
> (1461012211635847168), ... (80 adds)],commit=} 0 8
> INFO  - 2014-02-25 17:33:47.623; org.apache.solr.core.SolrCore;
> [CHESS_CORE]
> webapp=/solr path=/dataimport
> params={indent=true&command=status&_=1393329827620&wt=json} status=0
> QTime=1
>
>
>
> ----------------------------------------------------------------------------
>
> ----------------------------------------------------------------------------
> -------------------------
>
> Part of Query result screen :
>
> "docs": [
>       {
>         "id": "2158",
>         "mxMsg": [
>           ""
>         ],
>         "_version_": 1461012211583418400
>       },
>       {
>         "id": "2265",
>         "mxMsg": [
>           ""
>         ],
>         "_version_": 1461012211591807000
>       },
>
>
> ----------------------------------------------------------------------------
>
> ----------------------------------------------------------------------------
> ----
>
> As you see,
>
> 'id' is indexed properly, but 'mxMsg' is empty.
>
>
> ----------------------------------------------------------------------------
> -------------------------------------------------------
>
> Now, please suggest me so that I can get data in 'mxMsg' field. The binary
> data is stored inDB as BLOB type.
>
> Please note:  The same configuration is working fine ('mxMsg' displays data
> if XML data are in DB as BLOB type).
>
>
>
> Please help,
>
> Looking forward,
>
> Chandan
>
>
> -----Original Message-----
> From: Gora Mohanty [mailto:gora@mimirtech.com]
> Sent: Tuesday, February 25, 2014 4:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can not index raw binary data stored in Database in BLOB
> format.
>
> On 25 February 2014 14:54, Chandan khatua <ch...@nrifintech.com> wrote:
> > Hi Gora,
> >
> > The column type in DB is BLOB. It only stores binary data.
> >
> > If I do not use TikaEntityProcessor, then the following exception occurs:
> [...]
>
> It is difficult to follow what you are doing when you say one thing, and
> seem to do another. You say above that you are not using
> TikaEntityProcessor
> but your DIH data configuration file shows that you are. Please start with
> one configuration, and show us the *exact* files in use, and the error from
> the Solr logs.
>
> Regards,
> Gora
>
>

RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
Okey.

Here is my data-config file:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@//1.2.3.4:1/d11gr21" user="aaaa" password="aaaa" />
 <dataSource name="dastream" type="FieldStreamDataSource"/>
 <document>
  <entity 
      name="messages" pk="X_MSG_PK" 
      query="select * from table1"
      dataSource="db">
	 <field column ="X_MSG_PK" name ="id" />
	<entity name="message"
			transformer="ClobTransformer"
           			dataSource="dastream"
		    	processor="TikaEntityProcessor"
           			dataField="messages.MESSAGE"
          			 format="text">
			<field column="text" name="mxMsg" clob="true"/>
	</entity>
    </entity> 
 </document>
</dataConfig>

----------------------------------------------------------------------------
----------------------

Solr.log file :

INFO  - 2014-02-25 17:33:40.023; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/admin/mbeans
params={cat=QUERYHANDLER&_=1393329819994&wt=json} status=0 QTime=1 
INFO  - 2014-02-25 17:33:40.094; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/admin/mbeans
params={cat=QUERYHANDLER&_=1393329820083&wt=json} status=0 QTime=0 
INFO  - 2014-02-25 17:33:40.117; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1393329820089&wt=json} status=0
QTime=16 
INFO  - 2014-02-25 17:33:40.131; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/dataimport
params={indent=true&command=show-config&_=1393329820084} status=0 QTime=29 
INFO  - 2014-02-25 17:33:42.026;
org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration:
/dataconfig/data-config.xml
INFO  - 2014-02-25 17:33:42.031;
org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded
successfully
INFO  - 2014-02-25 17:33:42.033; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/dataimport
params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm
and=full-import&debug=false&wt=json} status=0 QTime=8 
INFO  - 2014-02-25 17:33:42.035;
org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
INFO  - 2014-02-25 17:33:42.043; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1393329822040&wt=json} status=0 QTime=0

INFO  - 2014-02-25 17:33:42.064;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
dataimport.properties
INFO  - 2014-02-25 17:33:42.092; org.apache.solr.search.SolrIndexSearcher;
Opening Searcher@2a858a73 realtime
INFO  - 2014-02-25 17:33:42.093;
org.apache.solr.handler.dataimport.JdbcDataSource$1; Creating a connection
for entity messages with URL: jdbc:oracle:thin:@//172.16.29.92:1521/d11gr21
INFO  - 2014-02-25 17:33:42.113;
org.apache.solr.handler.dataimport.JdbcDataSource$1; Time taken for
getConnection(): 19
INFO  - 2014-02-25 17:33:42.564;
org.apache.solr.handler.dataimport.DocBuilder; Import completed successfully
INFO  - 2014-02-25 17:33:42.564;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=fa
lse,softCommit=false,prepareCommit=false}
INFO  - 2014-02-25 17:33:42.867; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2
	
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C:\solr
-4.5.1\example\multicore\CHESS_CORE\data\index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_l,generation=21}
	
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@C:\solr
-4.5.1\example\multicore\CHESS_CORE\data\index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c6d8073;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_m,generation=22}
INFO  - 2014-02-25 17:33:42.868; org.apache.solr.core.SolrDeletionPolicy;
newest commit generation = 22
INFO  - 2014-02-25 17:33:42.882; org.apache.solr.search.SolrIndexSearcher;
Opening Searcher@558ea0cc main
INFO  - 2014-02-25 17:33:42.886; org.apache.solr.core.QuerySenderListener;
QuerySenderListener sending requests to Searcher@558ea0cc
main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)}
INFO  - 2014-02-25 17:33:42.889; org.apache.solr.core.QuerySenderListener;
QuerySenderListener done.
INFO  - 2014-02-25 17:33:42.889; org.apache.solr.core.SolrCore; [CHESS_CORE]
Registered new searcher Searcher@558ea0cc
main{StandardDirectoryReader(segments_m:55:nrt _d(4.5.1):C80)}
INFO  - 2014-02-25 17:33:42.893;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2014-02-25 17:33:42.899;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
dataimport.properties
INFO  - 2014-02-25 17:33:42.901;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Wrote last
indexed time to dataimport.properties
INFO  - 2014-02-25 17:33:42.905;
org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:0.839
INFO  - 2014-02-25 17:33:42.905;
org.apache.solr.update.processor.LogUpdateProcessor; [CHESS_CORE]
webapp=/solr path=/dataimport
params={optimize=false&indent=true&clean=true&commit=true&verbose=false&comm
and=full-import&debug=false&wt=json} status=0 QTime=8 {deleteByQuery=*:*
(-1461012211508969472),add=[2158 (1461012211583418368), 2265
(1461012211591806976), 2225 (1461012211597049856), 2241
(1461012211602292736), 2276 (1461012211607535616), 2277
(1461012211612778496), 2302 (1461012211619069952), 4558
(1461012211624312832), 2144 (1461012211629555712), 2145
(1461012211635847168), ... (80 adds)],commit=} 0 8
INFO  - 2014-02-25 17:33:47.623; org.apache.solr.core.SolrCore; [CHESS_CORE]
webapp=/solr path=/dataimport
params={indent=true&command=status&_=1393329827620&wt=json} status=0 QTime=1


----------------------------------------------------------------------------
----------------------------------------------------------------------------
-------------------------

Part of Query result screen :

"docs": [
      {
        "id": "2158",
        "mxMsg": [
          ""
        ],
        "_version_": 1461012211583418400
      },
      {
        "id": "2265",
        "mxMsg": [
          ""
        ],
        "_version_": 1461012211591807000
      },

----------------------------------------------------------------------------
----------------------------------------------------------------------------
----

As you see, 

'id' is indexed properly, but 'mxMsg' is empty.

----------------------------------------------------------------------------
-------------------------------------------------------

Now, please suggest me so that I can get data in 'mxMsg' field. The binary
data is stored inDB as BLOB type.

Please note:  The same configuration is working fine ('mxMsg' displays data
if XML data are in DB as BLOB type).
 


Please help,

Looking forward,

Chandan


-----Original Message-----
From: Gora Mohanty [mailto:gora@mimirtech.com] 
Sent: Tuesday, February 25, 2014 4:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

On 25 February 2014 14:54, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi Gora,
>
> The column type in DB is BLOB. It only stores binary data.
>
> If I do not use TikaEntityProcessor, then the following exception occurs:
[...]

It is difficult to follow what you are doing when you say one thing, and
seem to do another. You say above that you are not using TikaEntityProcessor
but your DIH data configuration file shows that you are. Please start with
one configuration, and show us the *exact* files in use, and the error from
the Solr logs.

Regards,
Gora


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Gora Mohanty <go...@mimirtech.com>.
On 25 February 2014 14:54, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi Gora,
>
> The column type in DB is BLOB. It only stores binary data.
>
> If I do not use TikaEntityProcessor, then the following exception occurs:
[...]

It is difficult to follow what you are doing when you say one thing, and
seem to do another. You say above that you are not using TikaEntityProcessor
but your DIH data configuration file shows that you are. Please start with
one configuration, and show us the *exact* files in use, and the error from
the Solr logs.

Regards,
Gora

RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
Hi Gora,

The column type in DB is BLOB. It only stores binary data.

If I do not use TikaEntityProcessor, then the following exception occurs:

        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)
59163 [Thread-16] ERROR org.apache.solr.handler.dataimport.DocBuilder  รป
Exception while processing: messages document : SolrInputDocument(fields:
[id
=2158]):org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: oracle.jdbc.driver.OracleBlobInputStream
cannot b
e cast to java.util.Iterator
        at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityPro
cessor.java:65)
        at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProce
ssor.java:73)
        at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProc
essorWrapper.java:243)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
469)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
495)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
408)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323
)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.ja
va:411)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476
)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)
Caused by: java.lang.ClassCastException:
oracle.jdbc.driver.OracleBlobInputStream cannot be cast to
java.util.Iterator
        at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityPro
cessor.java:59)
        ... 10 more



I have used ClobTransformer in data-config file as bellow and even then it
is not working:

<dataConfig>
<dataSource name="db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@//a.a.a.a:a/d11gr21" user="aaaa" password="aaaaa" />
 <dataSource name="dastream" type="FieldStreamDataSource"/>
 <document>
  <entity 
      name="messages" pk="x_MSG_PK" 
      query="select * from table1"
      dataSource="db">
	 <field column ="x_MSG_PK" name ="id" />
	<entity name="message"
			transformer="ClobTransformer"
            			dataSource="dastream"
		   	 processor="TikaEntityProcessor"
         			  dataField="messages.MESSAGE"
           			format="text">
			<field column="text" name="mxMsg" clob="true"/>
	</entity>
    </entity> 
 </document>
</dataConfig>


So, what changes do I need?

-Chandan


-----Original Message-----
From: Gora Mohanty [mailto:gora@mimirtech.com] 
Sent: Monday, February 24, 2014 5:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

On 24 February 2014 15:34, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi Gora !
>
> Your concern was "What is the type of the column used to store the 
> binary data in Oracle?"
> The column type is BLOB in DB.  The column can also have rich text file.

Um, your original message said that it does *not* contain richtext data. How
do you tell whether it has richtext data, or not? For just a binary blob,
the ClobTransformer should work, but you need the TikaEntityProcessor for
richtext data. If you do not know whether the data in the blob is richtext
or not, you will need to roll your own solution to determine that.

Regards,
Gora


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Gora Mohanty <go...@mimirtech.com>.
On 24 February 2014 15:34, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi Gora !
>
> Your concern was "What is the type of the column used to store the binary
> data in Oracle?"
> The column type is BLOB in DB.  The column can also have rich text file.

Um, your original message said that it does *not* contain richtext data. How
do you tell whether it has richtext data, or not? For just a binary blob, the
ClobTransformer should work, but you need the TikaEntityProcessor for richtext
data. If you do not know whether the data in the blob is richtext or
not, you will
need to roll your own solution to determine that.

Regards,
Gora

RE: Can not index raw binary data stored in Database in BLOB format.

Posted by Chandan khatua <ch...@nrifintech.com>.
Hi Gora !

Your concern was "What is the type of the column used to store the binary
data in Oracle?"
The column type is BLOB in DB.  The column can also have rich text file.

Regards,
Chandan


-----Original Message-----
From: Gora Mohanty [mailto:gora@mimirtech.com] 
Sent: Monday, February 24, 2014 3:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Can not index raw binary data stored in Database in BLOB
format.

On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi,
>
>
>
> We have raw binary data stored in database(not word,excel,xml etc 
> files) in BLOB.
>
> We are trying to index using TikaEntityProcessor but nothing seems to 
> get indexed.
>
> But the same configuration works when xml/word/excel files are stored 
> in the BLOB field.

Please start by reviewing http://wiki.apache.org/solr/DataImportHandler as
the above seems quite confused. Why are you using TikaEntityProcessor if the
data in the DB are not richtext files?

What is the type of the column used to store the binary data in Oracle? You
might be able to convert it with a ClobTransformer. Please see
http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are
_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

Regards,
Gora


Re: Can not index raw binary data stored in Database in BLOB format.

Posted by Gora Mohanty <go...@mimirtech.com>.
On 24 February 2014 12:51, Chandan khatua <ch...@nrifintech.com> wrote:
> Hi,
>
>
>
> We have raw binary data stored in database(not word,excel,xml etc files) in
> BLOB.
>
> We are trying to index using TikaEntityProcessor but nothing seems to get
> indexed.
>
> But the same configuration works when xml/word/excel files are stored in the
> BLOB field.

Please start by reviewing http://wiki.apache.org/solr/DataImportHandler as the
above seems quite confused. Why are you using TikaEntityProcessor if the data
in the DB are not richtext files?

What is the type of the column used to store the binary data in
Oracle? You might
be able to convert it with a ClobTransformer. Please see
http://wiki.apache.org/solr/DataImportHandler#ClobTransformer
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

Regards,
Gora