You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sascha Szott <sz...@zib.de> on 2009/08/11 16:35:14 UTC
Building documents using content residing both in database tables
and text files
Hello,
is it possible (and if it is, how can I accomplish it) to configure DIH
to build up index documents by using content that resides in different
data sources?
Here is an example scenario:
Let's assume we have a table T with two columns, ID (which is the
primary key of T) and TITLE. Furthermore, each record in T is assigned a
directory containing text files that were generated out of pdf documents
by using Tika. A directory name is build by using the ID of the record
in T associated to that directory, e.g. all text files associated to a
record with id = 101 are stored in direcory 101.
Is there a way to configure DIH such that it uses ID, TITLE and the
content of all related text files when building a document (the
documents should have three fields: id, title, and text)?
Furthermore, as you may have noticed, a second question arises
naturally: Will there be any integration of Solr Cell and DIH in an
upcoming release, so that it would be possible to directly use the pdf
documents instead of the extracted text files that were generated
outside of Solr?
Best,
Sascha
Re: Building documents using content residing both in database tables
and text files
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
isn't better to make a jar of PlaintextEntityProcessor and drop into
solr.home/lib ?
On Tue, Aug 11, 2009 at 11:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hi Noble,
>
> Noble Paul wrote:
>>
>> isn't it possible to do this by having two datasources (one Js=dbc and
>> another File) and two entities . The outer entity can read from a DB
>> and the inner entity can read from a file.
>
> Yes, it is. Here's my db-data-config.xml file:
>
> <!-- definition of data sources -->
> <dataSource name="ds.database"
> driver="..."
> url="..."
> user="..."
> password="..." />
> <dataSource name="ds.filesystem"
> type="FileDataSource" />
>
>
> <!-- building the document using both db and file content
> (files are stored in /tmp/<recordId>)
> -->
> <document name="doc">
> <entity name="t" query="select * from t" dataSource="ds.database">
> <field column="id" name="id" />
> <field column="title" name="title" />
> <entity name="dir"
> processor="FileListEntityProcessor"
> baseDir="/tmp/${id}"
> fileName=".*"
> dataSource="null"
> rootEntity="false" >
> <entity name="file"
> dataSource="ds.filesystem"
> processor="XPathEntityProcessor"
> forEach="/root"
> url="${dir.fileAbsolutePath}"
> stream="false" >
> <field column="text" xpath="/root" />
> </entity>
> </entity>
> </entity>
> </document>
>
>
> Only one additional adjustment has to be made: Since I'm using Solr 1.3 and
> it comes without PlainTextEntityProcessor, I have to transform my plain text
> files in xml files by surrounding the content with a root element. That's
> all!
>
>> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>>>
>>> Hello,
>>>
>>> is it possible (and if it is, how can I accomplish it) to configure DIH
>>> to
>>> build up index documents by using content that resides in different data
>>> sources?
>>>
>>> Here is an example scenario:
>>> Let's assume we have a table T with two columns, ID (which is the primary
>>> key of T) and TITLE. Furthermore, each record in T is assigned a
>>> directory
>>> containing text files that were generated out of pdf documents by using
>>> Tika. A directory name is build by using the ID of the record in T
>>> associated to that directory, e.g. all text files associated to a record
>>> with id = 101 are stored in direcory 101.
>>>
>>> Is there a way to configure DIH such that it uses ID, TITLE and the
>>> content
>>> of all related text files when building a document (the documents should
>>> have three fields: id, title, and text)?
>>>
>>> Furthermore, as you may have noticed, a second question arises naturally:
>>> Will there be any integration of Solr Cell and DIH in an upcoming
>>> release,
>>> so that it would be possible to directly use the pdf documents instead of
>>> the extracted text files that were generated outside of Solr?
>>
>> This is something I wish to see. But there has been no user request
>> yet. You can raise an issue and it can be looked upon
>
> I've raised issue SOLR-1358.
>
> Best,
> Sascha
>
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Building documents using content residing both in database tables
and text files
Posted by Sascha Szott <sz...@zib.de>.
Hi Noble,
Noble Paul wrote:
> isn't it possible to do this by having two datasources (one Js=dbc and
> another File) and two entities . The outer entity can read from a DB
> and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:
<!-- definition of data sources -->
<dataSource name="ds.database"
driver="..."
url="..."
user="..."
password="..." />
<dataSource name="ds.filesystem"
type="FileDataSource" />
<!-- building the document using both db and file content
(files are stored in /tmp/<recordId>)
-->
<document name="doc">
<entity name="t" query="select * from t" dataSource="ds.database">
<field column="id" name="id" />
<field column="title" name="title" />
<entity name="dir"
processor="FileListEntityProcessor"
baseDir="/tmp/${id}"
fileName=".*"
dataSource="null"
rootEntity="false" >
<entity name="file"
dataSource="ds.filesystem"
processor="XPathEntityProcessor"
forEach="/root"
url="${dir.fileAbsolutePath}"
stream="false" >
<field column="text" xpath="/root" />
</entity>
</entity>
</entity>
</document>
Only one additional adjustment has to be made: Since I'm using Solr 1.3
and it comes without PlainTextEntityProcessor, I have to transform my
plain text files in xml files by surrounding the content with a root
element. That's all!
> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
>> Hello,
>>
>> is it possible (and if it is, how can I accomplish it) to configure DIH to
>> build up index documents by using content that resides in different data
>> sources?
>>
>> Here is an example scenario:
>> Let's assume we have a table T with two columns, ID (which is the primary
>> key of T) and TITLE. Furthermore, each record in T is assigned a directory
>> containing text files that were generated out of pdf documents by using
>> Tika. A directory name is build by using the ID of the record in T
>> associated to that directory, e.g. all text files associated to a record
>> with id = 101 are stored in direcory 101.
>>
>> Is there a way to configure DIH such that it uses ID, TITLE and the content
>> of all related text files when building a document (the documents should
>> have three fields: id, title, and text)?
>>
>> Furthermore, as you may have noticed, a second question arises naturally:
>> Will there be any integration of Solr Cell and DIH in an upcoming release,
>> so that it would be possible to directly use the pdf documents instead of
>> the extracted text files that were generated outside of Solr?
>
> This is something I wish to see. But there has been no user request
> yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.
Best,
Sascha
Re: Building documents using content residing both in database tables
and text files
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
isn't it possible to do this by having two datasources (one Js=dbc and
another File) and two entities . The outer entity can read from a DB
and the inner entity can read from a file.
On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<sz...@zib.de> wrote:
> Hello,
>
> is it possible (and if it is, how can I accomplish it) to configure DIH to
> build up index documents by using content that resides in different data
> sources?
>
> Here is an example scenario:
> Let's assume we have a table T with two columns, ID (which is the primary
> key of T) and TITLE. Furthermore, each record in T is assigned a directory
> containing text files that were generated out of pdf documents by using
> Tika. A directory name is build by using the ID of the record in T
> associated to that directory, e.g. all text files associated to a record
> with id = 101 are stored in direcory 101.
>
> Is there a way to configure DIH such that it uses ID, TITLE and the content
> of all related text files when building a document (the documents should
> have three fields: id, title, and text)?
>
> Furthermore, as you may have noticed, a second question arises naturally:
> Will there be any integration of Solr Cell and DIH in an upcoming release,
> so that it would be possible to directly use the pdf documents instead of
> the extracted text files that were generated outside of Solr?
This is something I wish to see. But there has been no user request
yet. You can raise an issue and it can be looked upon
>
> Best,
> Sascha
>
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com