You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nathan Adams <na...@umich.edu> on 2009/01/24 01:26:00 UTC

DataImport TXT file entity processor

Is there a way to us Data Import Handler to index non-XML (i.e. simple
text) files (either via HTTP or FileSystem)?  I need to put the entire
contents of a text file into a single field of a document and the other
fields are being pulled out of Oracle...

 

-Nathan


RE: DIH handling of missing files

Posted by Nathan Adams <na...@umich.edu>.
Which appears to be v1.3, which explains the problem.  Thanks!

________________________________

From: Nathan Adams [mailto:natad@umich.edu]
Sent: Thu 01/29/2009 8:28 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH handling of missing files



I'm running the example from the DIH wiki page:

http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar

-Nathan


________________________________

From: Noble Paul ??????? ?????? [mailto:noble.paul@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files



onError="continue" must help .

which version of DIH are you using? onError is a Solr 1.4 feature
--Noble

On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.)  My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case.  Should it?
>
> <dataConfig>
>
>    <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
>    <dataSource type="HttpDataSource" name="http"/>
>
>    <document name="resources">
>
>        <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
>        <entity name="xmltext"
> url="http://???.com/$ <http:///???.com/$>  <http:///???.com/$> {metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
>            <field column="FULLTEXT" xpath="/content"/>
>
>        </entity>
>
>        </entity>
>
>    </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>



--
--Noble Paul





RE: DIH handling of missing files

Posted by Nathan Adams <na...@umich.edu>.
I'm running the example from the DIH wiki page:
 
http://wiki.apache.org/solr-data/attachments/DataImportHandler/attachments/example-solr-home.jar
 
-Nathan

 
________________________________

From: Noble Paul ??????? ?????? [mailto:noble.paul@gmail.com]
Sent: Wed 01/28/2009 11:32 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH handling of missing files



onError="continue" must help .

which version of DIH are you using? onError is a Solr 1.4 feature
--Noble

On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.)  My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case.  Should it?
>
> <dataConfig>
>
>    <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
>    <dataSource type="HttpDataSource" name="http"/>
>
>    <document name="resources">
>
>        <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
>        <entity name="xmltext"
> url="http://???.com/$ <http:///???.com/$> {metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
>            <field column="FULLTEXT" xpath="/content"/>
>
>        </entity>
>
>        </entity>
>
>    </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>



--
--Noble Paul



Re: DIH handling of missing files

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
onError="continue" must help .

which version of DIH are you using? onError is a Solr 1.4 feature
--Noble

On Thu, Jan 29, 2009 at 5:04 AM, Nathan Adams <na...@umich.edu> wrote:
> I am constructing documents from a JDBC datasource and a HTTP datasource
> (see data-config file below.)  My problem is that I cannot know if a
> particular HTTP URL is available at index time, so I need DIH to
> continue processing even if the HTTP location returns a 404.
> onError="continue" does not appear to help in this case.  Should it?
>
> <dataConfig>
>
>    <dataSource type="JdbcDataSource" name="db"
> driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
> user="???" password="???"/>
>
>    <dataSource type="HttpDataSource" name="http"/>
>
>    <document name="resources">
>
>        <entity name="metadata" dataSource="db" pk="RESOURCEID"
> query="select * from ????" onError="continue">
>
>        <entity name="xmltext"
> url="http://???.com/${metadata.RESOURCEID}.xml" forEach="/content"
> dataSource="http" processor="XPathEntityProcessor" onError="continue">
>
>            <field column="FULLTEXT" xpath="/content"/>
>
>        </entity>
>
>        </entity>
>
>    </document>
>
> </dataConfig>
>
> Thanks,
> Nathan
>



-- 
--Noble Paul

DIH handling of missing files

Posted by Nathan Adams <na...@umich.edu>.
I am constructing documents from a JDBC datasource and a HTTP datasource
(see data-config file below.)  My problem is that I cannot know if a
particular HTTP URL is available at index time, so I need DIH to
continue processing even if the HTTP location returns a 404.
onError="continue" does not appear to help in this case.  Should it?

<dataConfig>

    <dataSource type="JdbcDataSource" name="db"
driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@?????"
user="???" password="???"/>

    <dataSource type="HttpDataSource" name="http"/>
 
    <document name="resources"> 

        <entity name="metadata" dataSource="db" pk="RESOURCEID"
query="select * from ????" onError="continue">

        <entity name="xmltext"
url="http://???.com/${metadata.RESOURCEID}.xml" forEach="/content"
dataSource="http" processor="XPathEntityProcessor" onError="continue">

            <field column="FULLTEXT" xpath="/content"/>

        </entity>

        </entity>

    </document> 

</dataConfig>

Thanks,
Nathan

Re: DataImport TXT file entity processor

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
an EntityProcessor looks right to me. It may help us add more
attributes if needed.

PlainTextEntityProcessor looks like a good name. It can also be used
to read html etc.
--Noble

On Sat, Jan 24, 2009 at 12:37 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams <na...@umich.edu> wrote:
>
>> Is there a way to us Data Import Handler to index non-XML (i.e. simple
>> text) files (either via HTTP or FileSystem)?  I need to put the entire
>> contents of a text file into a single field of a document and the other
>> fields are being pulled out of Oracle...
>
>
> Not yet. But I think it will be nice to have. Can you open an issue in Jira?
>
> I think importing from HTTP was something another user had asked for
> recently. How do you get the url/path of this text file? That would help
> decide if we need a Transformer or EntityProcessor for these tasks.
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Re: DataImport TXT file entity processor

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams <na...@umich.edu> wrote:

> Is there a way to us Data Import Handler to index non-XML (i.e. simple
> text) files (either via HTTP or FileSystem)?  I need to put the entire
> contents of a text file into a single field of a document and the other
> fields are being pulled out of Oracle...


Not yet. But I think it will be nice to have. Can you open an issue in Jira?

I think importing from HTTP was something another user had asked for
recently. How do you get the url/path of this text file? That would help
decide if we need a Transformer or EntityProcessor for these tasks.
-- 
Regards,
Shalin Shekhar Mangar.