You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ZHANG Liang F <Li...@alcatel-sbell.com.cn> on 2012/03/27 07:33:04 UTC
how to store file path in Solr when using TikaEntityProcessor
Hi,
I am using DIH to index local file system. But the file path, size and lastmodified field were not stored. in the schema.xml I defined:
<fields>
<field name="title" type="string" indexed="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="true" />
<!--<field name="text" type="text" indexed="true" stored="true" />
liang added-->
<field name="path" type="string" indexed="true" stored="true" />
<field name="size" type="long" indexed="true" stored="true" />
<field name="lastmodified" type="date" indexed="true" stored="true" />
</fields>
And also defined tika-data-config.xml:
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="E:/my_project/ecmkit/infotouch"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true">
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--
<field column="text" name="text"/> -->
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
</entity>
</entity>
</document>
</dataConfig>
The Solr version is 3.5. any idea?
Thanks in advance.
RE: how to store file path in Solr when using TikaEntityProcessor
Posted by ZHANG Liang F <Li...@alcatel-sbell.com.cn>.
Hi, It does the magic! Thanks a lot!
Although I found the transformer was added there but has no reference, so I suppose it is not needed.
Thanks again!
-----Original Message-----
From: Luca Cavanna [mailto:cavannaluca@gmail.com]
Sent: 2012年3月28日 23:16
To: solr-user@lucene.apache.org
Cc: Ahmet Arslan
Subject: Re: how to store file path in Solr when using TikaEntityProcessor
Hi,
you should change your data-config moving data that come from FileListEntityProcessor to its entity, one level up. Try this configuration:
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
transformer="TemplateTransformer"
baseDir="/home/luca/Documents"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<entity name="tika-test" dataSource="bin"
processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
On Wed, Mar 28, 2012 at 3:50 AM, ZHANG Liang F < Liang.f.Zhang@alcatel-sbell.com.cn> wrote:
> Could you please show me how to get those values inside
> TikaEntityProcessor?
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> Sent: 2012年3月27日 22:43
> To: solr-user@lucene.apache.org
> Subject: Re: how to store file path in Solr when using
> TikaEntityProcessor
>
>
> > I am using DIH to index local file system. But the file path, size
> > and lastmodified field were not stored. in the schema.xml I defined:
> >
> > <fields>
> > <field name="title" type="string"
> > indexed="true" stored="true"/>
> > <field name="author" type="string"
> > indexed="true" stored="true" />
> > <!--<field name="text" type="text"
> > indexed="true" stored="true" />
> > liang added-->
> > <field name="path" type="string"
> > indexed="true" stored="true" />
> > <field name="size" type="long"
> > indexed="true" stored="true" />
> > <field name="lastmodified" type="date"
> > indexed="true" stored="true" />
> > </fields>
> >
> >
> > And also defined tika-data-config.xml:
> >
> > <dataConfig>
> > <dataSource name="bin"
> > type="BinFileDataSource" />
> > <document>
> > <entity name="f"
> > dataSource="null" rootEntity="false"
> >
> > processor="FileListEntityProcessor"
> >
> > baseDir="E:/my_project/ecmkit/infotouch"
> >
> > fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
> > onError="skip"
> >
> > recursive="true">
> > <entity
> > name="tika-test" dataSource="bin"
> > processor="TikaEntityProcessor"
> >
> > url="${f.fileAbsolutePath}" format="text"
> > onError="skip">
> >
> > <field column="Author" name="author" meta="true"/>
> >
> > <field column="title" name="title" meta="true"/>
> >
> > <!--
> >
> > <field column="text" name="text"/> -->
> >
> > <field column="fileAbsolutePath" name="path" />
> >
> > <field column="fileSize" name="size" />
> >
> > <field column="fileLastModified" name="lastmodified"
> > />
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> >
> > The Solr version is 3.5. any idea?
>
> The implicit fields fileDir, file, fileAbsolutePath, fileSize,
> fileLastModified are generated by the FileListEntityProcessor. They
> should be defined above the TikaEntityProcessor.
>
Re: how to store file path in Solr when using TikaEntityProcessor
Posted by Luca Cavanna <ca...@gmail.com>.
Hi,
you should change your data-config moving data that come from
FileListEntityProcessor to its entity, one level up. Try this configuration:
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
transformer="TemplateTransformer"
baseDir="/home/luca/Documents"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<entity name="tika-test" dataSource="bin"
processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<!--<field column="text" />-->
</entity>
</entity>
</document>
</dataConfig>
On Wed, Mar 28, 2012 at 3:50 AM, ZHANG Liang F <
Liang.f.Zhang@alcatel-sbell.com.cn> wrote:
> Could you please show me how to get those values inside
> TikaEntityProcessor?
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> Sent: 2012年3月27日 22:43
> To: solr-user@lucene.apache.org
> Subject: Re: how to store file path in Solr when using TikaEntityProcessor
>
>
> > I am using DIH to index local file system. But the file path, size and
> > lastmodified field were not stored. in the schema.xml I defined:
> >
> > <fields>
> > <field name="title" type="string"
> > indexed="true" stored="true"/>
> > <field name="author" type="string"
> > indexed="true" stored="true" />
> > <!--<field name="text" type="text"
> > indexed="true" stored="true" />
> > liang added-->
> > <field name="path" type="string"
> > indexed="true" stored="true" />
> > <field name="size" type="long"
> > indexed="true" stored="true" />
> > <field name="lastmodified" type="date"
> > indexed="true" stored="true" />
> > </fields>
> >
> >
> > And also defined tika-data-config.xml:
> >
> > <dataConfig>
> > <dataSource name="bin"
> > type="BinFileDataSource" />
> > <document>
> > <entity name="f"
> > dataSource="null" rootEntity="false"
> >
> > processor="FileListEntityProcessor"
> >
> > baseDir="E:/my_project/ecmkit/infotouch"
> >
> > fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
> > onError="skip"
> >
> > recursive="true">
> > <entity
> > name="tika-test" dataSource="bin"
> > processor="TikaEntityProcessor"
> >
> > url="${f.fileAbsolutePath}" format="text"
> > onError="skip">
> >
> > <field column="Author" name="author" meta="true"/>
> >
> > <field column="title" name="title" meta="true"/>
> >
> > <!--
> >
> > <field column="text" name="text"/> -->
> >
> > <field column="fileAbsolutePath" name="path" />
> >
> > <field column="fileSize" name="size" />
> >
> > <field column="fileLastModified" name="lastmodified"
> > />
> > </entity>
> > </entity>
> > </document>
> > </dataConfig>
> >
> >
> > The Solr version is 3.5. any idea?
>
> The implicit fields fileDir, file, fileAbsolutePath, fileSize,
> fileLastModified are generated by the FileListEntityProcessor. They should
> be defined above the TikaEntityProcessor.
>
RE: how to store file path in Solr when using TikaEntityProcessor
Posted by ZHANG Liang F <Li...@alcatel-sbell.com.cn>.
Could you please show me how to get those values inside TikaEntityProcessor?
-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: 2012年3月27日 22:43
To: solr-user@lucene.apache.org
Subject: Re: how to store file path in Solr when using TikaEntityProcessor
> I am using DIH to index local file system. But the file path, size and
> lastmodified field were not stored. in the schema.xml I defined:
>
> <fields>
> <field name="title" type="string"
> indexed="true" stored="true"/>
> <field name="author" type="string"
> indexed="true" stored="true" />
> <!--<field name="text" type="text"
> indexed="true" stored="true" />
> liang added-->
> <field name="path" type="string"
> indexed="true" stored="true" />
> <field name="size" type="long"
> indexed="true" stored="true" />
> <field name="lastmodified" type="date"
> indexed="true" stored="true" />
> </fields>
>
>
> And also defined tika-data-config.xml:
>
> <dataConfig>
> <dataSource name="bin"
> type="BinFileDataSource" />
> <document>
> <entity name="f"
> dataSource="null" rootEntity="false"
>
> processor="FileListEntityProcessor"
>
> baseDir="E:/my_project/ecmkit/infotouch"
>
> fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
> onError="skip"
>
> recursive="true">
> <entity
> name="tika-test" dataSource="bin"
> processor="TikaEntityProcessor"
>
> url="${f.fileAbsolutePath}" format="text"
> onError="skip">
>
> <field column="Author" name="author" meta="true"/>
>
> <field column="title" name="title" meta="true"/>
>
> <!--
>
> <field column="text" name="text"/> -->
>
> <field column="fileAbsolutePath" name="path" />
>
> <field column="fileSize" name="size" />
>
> <field column="fileLastModified" name="lastmodified"
> />
> </entity>
> </entity>
> </document>
> </dataConfig>
>
>
> The Solr version is 3.5. any idea?
The implicit fields fileDir, file, fileAbsolutePath, fileSize, fileLastModified are generated by the FileListEntityProcessor. They should be defined above the TikaEntityProcessor.
Re: how to store file path in Solr when using TikaEntityProcessor
Posted by Ahmet Arslan <io...@yahoo.com>.
> I am using DIH to index local file system. But the file
> path, size and lastmodified field were not stored. in the
> schema.xml I defined:
>
> <fields>
> <field name="title" type="string"
> indexed="true" stored="true"/>
> <field name="author" type="string"
> indexed="true" stored="true" />
> <!--<field name="text" type="text"
> indexed="true" stored="true" />
> liang added-->
> <field name="path" type="string"
> indexed="true" stored="true" />
> <field name="size" type="long"
> indexed="true" stored="true" />
> <field name="lastmodified" type="date"
> indexed="true" stored="true" />
> </fields>
>
>
> And also defined tika-data-config.xml:
>
> <dataConfig>
> <dataSource name="bin"
> type="BinFileDataSource" />
> <document>
> <entity name="f"
> dataSource="null" rootEntity="false"
>
> processor="FileListEntityProcessor"
>
> baseDir="E:/my_project/ecmkit/infotouch"
>
> fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
> onError="skip"
>
> recursive="true">
> <entity
> name="tika-test" dataSource="bin"
> processor="TikaEntityProcessor"
>
> url="${f.fileAbsolutePath}" format="text"
> onError="skip">
>
> <field column="Author" name="author" meta="true"/>
>
> <field column="title" name="title" meta="true"/>
>
> <!--
>
> <field column="text" name="text"/> -->
>
> <field column="fileAbsolutePath" name="path" />
>
> <field column="fileSize" name="size" />
>
> <field column="fileLastModified" name="lastmodified"
> />
> </entity>
> </entity>
> </document>
> </dataConfig>
>
>
> The Solr version is 3.5. any idea?
The implicit fields fileDir, file, fileAbsolutePath, fileSize, fileLastModified are generated by the FileListEntityProcessor. They should be defined above the TikaEntityProcessor.