You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com> on 2009/08/11 17:00:52 UTC

Re: Building documents using content residing both in database tables and text files

isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.


On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hello,
>
> is it possible (and if it is, how can I accomplish it) to configure DIH to
> build up index documents by using content that resides in different data
> sources?
>
> Here is an example scenario:
> Let's assume we have a table T with two columns, ID (which is the primary
> key of T) and TITLE. Furthermore, each record in T is assigned a directory
> containing text files that were generated out of pdf documents by using
> Tika. A directory name is build by using the ID of the record in T
> associated to that directory, e.g. all text files associated to a record
> with id = 101 are stored in direcory 101.
>
> Is there a way to configure DIH such that it uses ID, TITLE and the content
> of all related text files when building a document (the documents should
> have three fields: id, title, and text)?
>
> Furthermore, as you may have noticed, a second question arises naturally:
> Will there be any integration of Solr Cell and DIH in an upcoming release,
> so that it would be possible to directly use the pdf documents instead of
> the extracted text files that were generated outside of Solr?

This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon
>
> Best,
> Sascha
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Building documents using content residing both in database tables and text files

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
isn't better to make a jar of PlaintextEntityProcessor and drop into
solr.home/lib ?

On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hi Noble,
>
> Noble Paul wrote:
>>
>> isn't it possible to do this by having two datasources (one Js=dbc and
>> another File) and two entities . The outer entity can read from a DB
>> and the inner entity can read from a file.
>
> Yes, it is. Here's my db-data-config.xml file:
>
> <!-- definition of data sources -->
> <dataSource name="ds.database"
>            driver="..."
>            url="..."
>            user="..."
>            password="..." />
> <dataSource name="ds.filesystem"
>            type="FileDataSource" />
>
>
> <!-- building the document using both db and file content
>     (files are stored in /tmp/<recordId>)
> -->
> <document name="doc">
>  <entity name="t" query="select * from t" dataSource="ds.database">
>    <field column="id" name="id" />
>    <field column="title" name="title" />
>    <entity name="dir"
>            processor="FileListEntityProcessor"
>            baseDir="/tmp/${id}"
>            fileName=".*"
>            dataSource="null"
>            rootEntity="false" >
>      <entity name="file"
>              dataSource="ds.filesystem"
>              processor="XPathEntityProcessor"
>              forEach="/root"
>              url="${dir.fileAbsolutePath}"
>              stream="false" >
>        <field column="text" xpath="/root" />
>      </entity>
>    </entity>
>  </entity>
> </document>
>
>
> Only one additional adjustment has to be made: Since I'm using Solr 1.3 and
> it comes without PlainTextEntityProcessor, I have to transform my plain text
> files in xml files by surrounding the content with a root element. That's
> all!
>
>> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>>>
>>> Hello,
>>>
>>> is it possible (and if it is, how can I accomplish it) to configure DIH
>>> to
>>> build up index documents by using content that resides in different data
>>> sources?
>>>
>>> Here is an example scenario:
>>> Let's assume we have a table T with two columns, ID (which is the primary
>>> key of T) and TITLE. Furthermore, each record in T is assigned a
>>> directory
>>> containing text files that were generated out of pdf documents by using
>>> Tika. A directory name is build by using the ID of the record in T
>>> associated to that directory, e.g. all text files associated to a record
>>> with id = 101 are stored in direcory 101.
>>>
>>> Is there a way to configure DIH such that it uses ID, TITLE and the
>>> content
>>> of all related text files when building a document (the documents should
>>> have three fields: id, title, and text)?
>>>
>>> Furthermore, as you may have noticed, a second question arises naturally:
>>> Will there be any integration of Solr Cell and DIH in an upcoming
>>> release,
>>> so that it would be possible to directly use the pdf documents instead of
>>> the extracted text files that were generated outside of Solr?
>>
>> This is something I wish to see. But there has been no user request
>> yet. You can raise an issue and it can be looked upon
>
> I've raised issue SOLR-1358.
>
> Best,
> Sascha
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Building documents using content residing both in database tables and text files

Posted by Sascha Szott <sz...@zib.de>.
Hi Noble,

Noble Paul wrote:
> isn't it possible to do this by having two datasources (one Js=dbc and
> another File) and two entities . The outer entity can read from a DB
> and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:

<!-- definition of data sources -->
<dataSource name="ds.database"
             driver="..."
             url="..."
             user="..."
             password="..." />
<dataSource name="ds.filesystem"
             type="FileDataSource" />


<!-- building the document using both db and file content
      (files are stored in /tmp/<recordId>)
-->
<document name="doc">
   <entity name="t" query="select * from t" dataSource="ds.database">
     <field column="id" name="id" />
     <field column="title" name="title" />
     <entity name="dir"
             processor="FileListEntityProcessor"
             baseDir="/tmp/${id}"
             fileName=".*"
             dataSource="null"
             rootEntity="false" >
       <entity name="file"
               dataSource="ds.filesystem"
               processor="XPathEntityProcessor"
               forEach="/root"
               url="${dir.fileAbsolutePath}"
               stream="false" >
         <field column="text" xpath="/root" />
       </entity>
     </entity>
   </entity>
</document>


Only one additional adjustment has to be made: Since I'm using Solr 1.3 
and it comes without PlainTextEntityProcessor, I have to transform my 
plain text files in xml files by surrounding the content with a root 
element. That's all!

> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>> Hello,
>>
>> is it possible (and if it is, how can I accomplish it) to configure DIH to
>> build up index documents by using content that resides in different data
>> sources?
>>
>> Here is an example scenario:
>> Let's assume we have a table T with two columns, ID (which is the primary
>> key of T) and TITLE. Furthermore, each record in T is assigned a directory
>> containing text files that were generated out of pdf documents by using
>> Tika. A directory name is build by using the ID of the record in T
>> associated to that directory, e.g. all text files associated to a record
>> with id = 101 are stored in direcory 101.
>>
>> Is there a way to configure DIH such that it uses ID, TITLE and the content
>> of all related text files when building a document (the documents should
>> have three fields: id, title, and text)?
>>
>> Furthermore, as you may have noticed, a second question arises naturally:
>> Will there be any integration of Solr Cell and DIH in an upcoming release,
>> so that it would be possible to directly use the pdf documents instead of
>> the extracted text files that were generated outside of Solr?
> 
> This is something I wish to see. But there has been no user request
> yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.

Best,
Sascha