You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nathan Adams <na...@umich.edu> on 2009/01/24 01:26:00 UTC
DataImport TXT file entity processor
Is there a way to us Data Import Handler to index non-XML (i.e. simple
text) files (either via HTTP or FileSystem)? I need to put the entire
contents of a text file into a single field of a document and the other
fields are being pulled out of Oracle...
-Nathan
RE: DIH handling of missing files
Posted by Nathan Adams <na...@umich.edu>.
Which appears to be v1.3, which explains the problem. Thanks!
________________________________
From: Nathan Adams [mailto:natad@umich.edu]
Sent: Thu 01/29/2009 8:28 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH handling of missing files
I'm running the example from the DIH wiki page:
http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar
-Nathan
________________________________
From: Noble Paul ??????? ?????? [mailto:noble.paul@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files
onError="continue" must help .
which version of DIH are you using? onError is a Solr 1.4 feature
--Noble
On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.) My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case. Should it?
>
> <dataConfig>
>
> <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
> <dataSource type="HttpDataSource" name="http"/>
>
> <document name="resources">
>
> <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
> <entity name="xmltext"
> url="http://???.com/$ <http:///???.com/$> <http:///???.com/$> {metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
> <field column="FULLTEXT" xpath="/content"/>
>
> </entity>
>
> </entity>
>
> </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>
--
--Noble Paul
RE: DIH handling of missing files
Posted by Nathan Adams <na...@umich.edu>.
I'm running the example from the DIH wiki page:
http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar
-Nathan
________________________________
From: Noble Paul ??????? ?????? [mailto:noble.paul@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files
onError="continue" must help .
which version of DIH are you using? onError is a Solr 1.4 feature
--Noble
On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.) My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case. Should it?
>
> <dataConfig>
>
> <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
> <dataSource type="HttpDataSource" name="http"/>
>
> <document name="resources">
>
> <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
> <entity name="xmltext"
> url="http://???.com/$ <http:///???.com/$> {metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
> <field column="FULLTEXT" xpath="/content"/>
>
> </entity>
>
> </entity>
>
> </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>
--
--Noble Paul
Re: DIH handling of missing files
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
onError="continue" must help .
which version of DIH are you using? onError is a Solr 1.4 feature
--Noble
On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.) My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case. Should it?
>
> <dataConfig>
>
> <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
> <dataSource type="HttpDataSource" name="http"/>
>
> <document name="resources">
>
> <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
> <entity name="xmltext"
> url="http://???.com/${metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
> <field column="FULLTEXT" xpath="/content"/>
>
> </entity>
>
> </entity>
>
> </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>
--
--Noble Paul
DIH handling of missing files
Posted by Nathan Adams <na...@umich.edu>.
I am constructing documents from a JDBC datasource and a HTTP datasource
(see data-config file below.) My problem is that I cannot know if a
particular HTTP URL is available at index time, so I need DIH to
continue processing even if the HTTP location returns a 404.
onError="continue" does not appear to help in this case. Should it?
<dataConfig>
<dataSource type="JdbcDataSource" name="db"
driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
user="???" password="???"/>
<dataSource type="HttpDataSource" name="http"/>
<document name="resources">
<entity name="metadata" dataSource="db" pk="RESOURCEID"
query="select * from ????" onError="continue">
<entity name="xmltext"
url="http://???.com/${metadata.RESOURCEID}.xml" forEach="/content"
dataSource="http" processor="XPathEntityProcessor" onError="continue">
<field column="FULLTEXT" xpath="/content"/>
</entity>
</entity>
</document>
</dataConfig>
Thanks,
Nathan
Re: DataImport TXT file entity processor
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
an EntityProcessor looks right to me. It may help us add more
attributes if needed.
PlainTextEntityProcessor looks like a good name. It can also be used
to read html etc.
--Noble
On Sat, Jan 24, 2009 at 12:37 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams <na...@umich.edu> wrote:
>
>> Is there a way to us Data Import Handler to index non-XML (i.e. simple
>> text) files (either via HTTP or FileSystem)? I need to put the entire
>> contents of a text file into a single field of a document and the other
>> fields are being pulled out of Oracle...
>
>
> Not yet. But I think it will be nice to have. Can you open an issue in Jira?
>
> I think importing from HTTP was something another user had asked for
> recently. How do you get the url/path of this text file? That would help
> decide if we need a Transformer or EntityProcessor for these tasks.
> --
> Regards,
> Shalin Shekhar Mangar.
>
--
--Noble Paul
Re: DataImport TXT file entity processor
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams <na...@umich.edu> wrote:
> Is there a way to us Data Import Handler to index non-XML (i.e. simple
> text) files (either via HTTP or FileSystem)? I need to put the entire
> contents of a text file into a single field of a document and the other
> fields are being pulled out of Oracle...
Not yet. But I think it will be nice to have. Can you open an issue in Jira?
I think importing from HTTP was something another user had asked for
recently. How do you get the url/path of this text file? That would help
decide if we need a Transformer or EntityProcessor for these tasks.
--
Regards,
Shalin Shekhar Mangar.