You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com> on 2009/08/11 17:00:52 UTC
Re: Building documents using content residing both in database tables
and text files
isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.
On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hello,
>
> is it possible (and if it is, how can I accomplish it) to configure DIH to
> build up index documents by using content that resides in different data
> sources?
>
> Here is an example scenario:
> Let's assume we have a table T with two columns, ID (which is the primary
> key of T) and TITLE. Furthermore, each record in T is assigned a directory
> containing text files that were generated out of pdf documents by using
> Tika. A directory name is build by using the ID of the record in T
> associated to that directory, e.g. all text files associated to a record
> with id = 101 are stored in direcory 101.
>
> Is there a way to configure DIH such that it uses ID, TITLE and the content
> of all related text files when building a document (the documents should
> have three fields: id, title, and text)?
>
> Furthermore, as you may have noticed, a second question arises naturally:
> Will there be any integration of Solr Cell and DIH in an upcoming release,
> so that it would be possible to directly use the pdf documents instead of
> the extracted text files that were generated outside of Solr?
This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon
>
> Best,
> Sascha
>
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Building documents using content residing both in database tables
and text files
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
isn't better to make a jar of PlaintextEntityProcessor and drop into
solr.home/lib ?
On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hi Noble,
>
> Noble Paul wrote:
>>
>> isn't it possible to do this by having two datasources (one Js=dbc and
>> another File) and two entities . The outer entity can read from a DB
>> and the inner entity can read from a file.
>
> Yes, it is. Here's my db-data-config.xml file:
>
> <!-- definition of data sources -->
> <dataSource name="ds.database"
> driver="..."
> url="..."
> user="..."
> password="..." />
> <dataSource name="ds.filesystem"
> type="FileDataSource" />
>
>
> <!-- building the document using both db and file content
> (files are stored in /tmp/<recordId>)
> -->
> <document name="doc">
> <entity name="t" query="select * from t" dataSource="ds.database">
> <field column="id" name="id" />
> <field column="title" name="title" />
> <entity name="dir"
> processor="FileListEntityProcessor"
> baseDir="/tmp/${id}"
> fileName=".*"
> dataSource="null"
> rootEntity="false" >
> <entity name="file"
> dataSource="ds.filesystem"
> processor="XPathEntityProcessor"
> forEach="/root"
> url="${dir.fileAbsolutePath}"
> stream="false" >
> <field column="text" xpath="/root" />
> </entity>
> </entity>
> </entity>
> </document>
>
>
> Only one additional adjustment has to be made: Since I'm using Solr 1.3 and
> it comes without PlainTextEntityProcessor, I have to transform my plain text
> files in xml files by surrounding the content with a root element. That's
> all!
>
>> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>>>
>>> Hello,
>>>
>>> is it possible (and if it is, how can I accomplish it) to configure DIH
>>> to
>>> build up index documents by using content that resides in different data
>>> sources?
>>>
>>> Here is an example scenario:
>>> Let's assume we have a table T with two columns, ID (which is the primary
>>> key of T) and TITLE. Furthermore, each record in T is assigned a
>>> directory
>>> containing text files that were generated out of pdf documents by using
>>> Tika. A directory name is build by using the ID of the record in T
>>> associated to that directory, e.g. all text files associated to a record
>>> with id = 101 are stored in direcory 101.
>>>
>>> Is there a way to configure DIH such that it uses ID, TITLE and the
>>> content
>>> of all related text files when building a document (the documents should
>>> have three fields: id, title, and text)?
>>>
>>> Furthermore, as you may have noticed, a second question arises naturally:
>>> Will there be any integration of Solr Cell and DIH in an upcoming
>>> release,
>>> so that it would be possible to directly use the pdf documents instead of
>>> the extracted text files that were generated outside of Solr?
>>
>> This is something I wish to see. But there has been no user request
>> yet. You can raise an issue and it can be looked upon
>
> I've raised issue SOLR-1358.
>
> Best,
> Sascha
>
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Building documents using content residing both in database tables
and text files
Posted by Sascha Szott <sz...@zib.de>.
Hi Noble,
Noble Paul wrote:
> isn't it possible to do this by having two datasources (one Js=dbc and
> another File) and two entities . The outer entity can read from a DB
> and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:
<!-- definition of data sources -->
<dataSource name="ds.database"
driver="..."
url="..."
user="..."
password="..." />
<dataSource name="ds.filesystem"
type="FileDataSource" />
<!-- building the document using both db and file content
(files are stored in /tmp/<recordId>)
-->
<document name="doc">
<entity name="t" query="select * from t" dataSource="ds.database">
<field column="id" name="id" />
<field column="title" name="title" />
<entity name="dir"
processor="FileListEntityProcessor"
baseDir="/tmp/${id}"
fileName=".*"
dataSource="null"
rootEntity="false" >
<entity name="file"
dataSource="ds.filesystem"
processor="XPathEntityProcessor"
forEach="/root"
url="${dir.fileAbsolutePath}"
stream="false" >
<field column="text" xpath="/root" />
</entity>
</entity>
</entity>
</document>
Only one additional adjustment has to be made: Since I'm using Solr 1.3
and it comes without PlainTextEntityProcessor, I have to transform my
plain text files in xml files by surrounding the content with a root
element. That's all!
> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>> Hello,
>>
>> is it possible (and if it is, how can I accomplish it) to configure DIH to
>> build up index documents by using content that resides in different data
>> sources?
>>
>> Here is an example scenario:
>> Let's assume we have a table T with two columns, ID (which is the primary
>> key of T) and TITLE. Furthermore, each record in T is assigned a directory
>> containing text files that were generated out of pdf documents by using
>> Tika. A directory name is build by using the ID of the record in T
>> associated to that directory, e.g. all text files associated to a record
>> with id = 101 are stored in direcory 101.
>>
>> Is there a way to configure DIH such that it uses ID, TITLE and the content
>> of all related text files when building a document (the documents should
>> have three fields: id, title, and text)?
>>
>> Furthermore, as you may have noticed, a second question arises naturally:
>> Will there be any integration of Solr Cell and DIH in an upcoming release,
>> so that it would be possible to directly use the pdf documents instead of
>> the extracted text files that were generated outside of Solr?
>
> This is something I wish to see. But there has been no user request
> yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.
Best,
Sascha