You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sascha Szott <sz...@zib.de> on 2009/08/11 16:35:14 UTC

Building documents using content residing both in database tables and text files

Hello,

is it possible (and if it is, how can I accomplish it) to configure DIH 
to build up index documents by using content that resides in different 
data sources?

Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the 
primary key of T) and TITLE. Furthermore, each record in T is assigned a 
directory containing text files that were generated out of pdf documents 
by using Tika. A directory name is build by using the ID of the record 
in T associated to that directory, e.g. all text files associated to a 
record with id = 101 are stored in direcory 101.

Is there a way to configure DIH such that it uses ID, TITLE and the 
content of all related text files when building a document (the 
documents should have three fields: id, title, and text)?

Furthermore, as you may have noticed, a second question arises 
naturally: Will there be any integration of Solr Cell and DIH in an 
upcoming release, so that it would be possible to directly use the pdf 
documents instead of the extracted text files that were generated 
outside of Solr?

Best,
Sascha

Re: Building documents using content residing both in database tables and text files

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

isn't better to make a jar of PlaintextEntityProcessor and drop into
solr.home/lib ?

On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hi Noble,
>
> Noble Paul wrote:
>>
>> isn't it possible to do this by having two datasources (one Js=dbc and
>> another File) and two entities . The outer entity can read from a DB
>> and the inner entity can read from a file.
>
> Yes, it is. Here's my db-data-config.xml file:
>
> <!-- definition of data sources -->
> <dataSource name="ds.database"
>            driver="..."
>            url="..."
>            user="..."
>            password="..." />
> <dataSource name="ds.filesystem"
>            type="FileDataSource" />
>
>
> <!-- building the document using both db and file content
>     (files are stored in /tmp/<recordId>)
> -->
> <document name="doc">
>  <entity name="t" query="select * from t" dataSource="ds.database">
>    <field column="id" name="id" />
>    <field column="title" name="title" />
>    <entity name="dir"
>            processor="FileListEntityProcessor"
>            baseDir="/tmp/${id}"
>            fileName=".*"
>            dataSource="null"
>            rootEntity="false" >
>      <entity name="file"
>              dataSource="ds.filesystem"
>              processor="XPathEntityProcessor"
>              forEach="/root"
>              url="${dir.fileAbsolutePath}"
>              stream="false" >
>        <field column="text" xpath="/root" />
>      </entity>
>    </entity>
>  </entity>
> </document>
>
>
> Only one additional adjustment has to be made: Since I'm using Solr 1.3 and
> it comes without PlainTextEntityProcessor, I have to transform my plain text
> files in xml files by surrounding the content with a root element. That's
> all!
>
>> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>>>
>>> Hello,
>>>
>>> is it possible (and if it is, how can I accomplish it) to configure DIH
>>> to
>>> build up index documents by using content that resides in different data
>>> sources?
>>>
>>> Here is an example scenario:
>>> Let's assume we have a table T with two columns, ID (which is the primary
>>> key of T) and TITLE. Furthermore, each record in T is assigned a
>>> directory
>>> containing text files that were generated out of pdf documents by using
>>> Tika. A directory name is build by using the ID of the record in T
>>> associated to that directory, e.g. all text files associated to a record
>>> with id = 101 are stored in direcory 101.
>>>
>>> Is there a way to configure DIH such that it uses ID, TITLE and the
>>> content
>>> of all related text files when building a document (the documents should
>>> have three fields: id, title, and text)?
>>>
>>> Furthermore, as you may have noticed, a second question arises naturally:
>>> Will there be any integration of Solr Cell and DIH in an upcoming
>>> release,
>>> so that it would be possible to directly use the pdf documents instead of
>>> the extracted text files that were generated outside of Solr?
>>
>> This is something I wish to see. But there has been no user request
>> yet. You can raise an issue and it can be looked upon
>
> I've raised issue SOLR-1358.
>
> Best,
> Sascha
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Building documents using content residing both in database tables and text files

Posted by Sascha Szott <sz...@zib.de>.

Hi Noble,

Noble Paul wrote:
> isn't it possible to do this by having two datasources (one Js=dbc and
> another File) and two entities . The outer entity can read from a DB
> and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:

<!-- definition of data sources -->
<dataSource name="ds.database"
             driver="..."
             url="..."
             user="..."
             password="..." />
<dataSource name="ds.filesystem"
             type="FileDataSource" />


<!-- building the document using both db and file content
      (files are stored in /tmp/<recordId>)
-->
<document name="doc">
   <entity name="t" query="select * from t" dataSource="ds.database">
     <field column="id" name="id" />
     <field column="title" name="title" />
     <entity name="dir"
             processor="FileListEntityProcessor"
             baseDir="/tmp/${id}"
             fileName=".*"
             dataSource="null"
             rootEntity="false" >
       <entity name="file"
               dataSource="ds.filesystem"
               processor="XPathEntityProcessor"
               forEach="/root"
               url="${dir.fileAbsolutePath}"
               stream="false" >
         <field column="text" xpath="/root" />
       </entity>
     </entity>
   </entity>
</document>


Only one additional adjustment has to be made: Since I'm using Solr 1.3 
and it comes without PlainTextEntityProcessor, I have to transform my 
plain text files in xml files by surrounding the content with a root 
element. That's all!

> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>> Hello,
>>
>> is it possible (and if it is, how can I accomplish it) to configure DIH to
>> build up index documents by using content that resides in different data
>> sources?
>>
>> Here is an example scenario:
>> Let's assume we have a table T with two columns, ID (which is the primary
>> key of T) and TITLE. Furthermore, each record in T is assigned a directory
>> containing text files that were generated out of pdf documents by using
>> Tika. A directory name is build by using the ID of the record in T
>> associated to that directory, e.g. all text files associated to a record
>> with id = 101 are stored in direcory 101.
>>
>> Is there a way to configure DIH such that it uses ID, TITLE and the content
>> of all related text files when building a document (the documents should
>> have three fields: id, title, and text)?
>>
>> Furthermore, as you may have noticed, a second question arises naturally:
>> Will there be any integration of Solr Cell and DIH in an upcoming release,
>> so that it would be possible to directly use the pdf documents instead of
>> the extracted text files that were generated outside of Solr?
> 
> This is something I wish to see. But there has been no user request
> yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.

Best,
Sascha

Re: Building documents using content residing both in database tables and text files

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.


On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hello,
>
> is it possible (and if it is, how can I accomplish it) to configure DIH to
> build up index documents by using content that resides in different data
> sources?
>
> Here is an example scenario:
> Let's assume we have a table T with two columns, ID (which is the primary
> key of T) and TITLE. Furthermore, each record in T is assigned a directory
> containing text files that were generated out of pdf documents by using
> Tika. A directory name is build by using the ID of the record in T
> associated to that directory, e.g. all text files associated to a record
> with id = 101 are stored in direcory 101.
>
> Is there a way to configure DIH such that it uses ID, TITLE and the content
> of all related text files when building a document (the documents should
> have three fields: id, title, and text)?
>
> Furthermore, as you may have noticed, a second question arises naturally:
> Will there be any integration of Solr Cell and DIH in an upcoming release,
> so that it would be possible to directly use the pdf documents instead of
> the extracted text files that were generated outside of Solr?

This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon
>
> Best,
> Sascha
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com