You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "O. Klein" <kl...@octoweb.nl> on 2011/07/29 13:54:27 UTC

Combine XML data with DIH

I have folder with XML files

1.xml contains:
<id>http://www.site.com/1.html</id>
<link>http://www.othersite.com/2.html</link>
<content>bla1</content>

2.xml contains:
<id>http://www.othersite.com/2.html</id>
<content>bla2&lt;//content&gt;

I want to  create document in Solr:

<id>http://www.site.com/1.html</id>
<content>bla2&lt;//content&gt;

Can this be done with DIH? And how?

--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209413.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combine XML data with DIH

Posted by "O. Klein" <kl...@octoweb.nl>.
O. Klein wrote:
> 
> 
> O. Klein wrote:
>> 
>> I have folder with XML files
>> 
>> 1.xml contains:
>> <id>http://www.site.com/1.html</id>
>> <content>blacontent</content>
>> <title>blatitle&lt;//title&gt;
>> 
>> 2.xml contains:
>> <id>http://www.site.com/1.html</id>
>> <title>blatitle2&lt;//title&gt;
>> 
>> I want to  create document in Solr:
>> 
>> <id>http://www.site.com/1.html</id>
>> <content>blacontent</content>
>> <title>blatitle2&lt;//title&gt;
>> 
>> 
> 
> I changed my problem in the quotes as it's a little different and
> hopefully easier to solve.
> 
> Can this be done with DIH? And how?
> 

Hmm, I tried to index all docs and JOIN them on id. This didn't work as it
only shows the fields in the linked document.

Is there some way to show all the fields of the combined documents?


--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3425844.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combine XML data with DIH

Posted by "O. Klein" <kl...@octoweb.nl>.
O. Klein wrote:
> 
> I have folder with XML files
> 
> 1.xml contains:
> <id>http://www.site.com/1.html</id>
> <content>blacontent</content>
> 
> 2.xml contains:
> <id>http://www.site.com/1.html</id>
> <title>blatitle&lt;//title&gt;
> 
> I want to  create document in Solr:
> 
> <id>http://www.site.com/1.html</id>
> <content>blacontent</content>
> <title>blatitle&lt;//title&gt;
> 
> 

I changed my problem in the quotes as it's a little different and hopefully
easier to solve.

Can this be done with DIH? And how?

--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3423888.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combine XML data with DIH

Posted by "O. Klein" <kl...@octoweb.nl>.
Yeah, but how do I combine the two based on the value in <link>?

--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209983.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combine XML data with DIH

Posted by abhayd <aj...@hotmail.com>.
hi

I have never done this with xml files but u can have multiple data sources
in dih config 

http://wiki.apache.org/solr/DataImportHandler#multipleds

abhay


--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209933.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combine XML data with DIH

Posted by "O. Klein" <kl...@octoweb.nl>.
To make it easier, I included example config:

<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="file" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="false"
baseDir="/srv/www/servers/crawler/files">
  <entity name="crawl" pk="id" datasource="file"
url="${file.fileAbsolutePath}" processor="XPathEntityProcessor"
forEach="/doc" transformer="RegexTransformer">
    <field column="id" xpath="/doc/id" />
    <field column="link" xpath="/doc/link" />
    <field column="content" xpath="/doc/content" />
    </entity>
</entity>
</document>
</dataConfig>


O. Klein wrote:
> 
> I have folder with XML files
> 
> 1.xml contains:
> <id>http://www.site.com/1.html</id>
> <link>http://www.othersite.com/2.html</link>
> <content>bla1</content>
> 
> 2.xml contains:
> <id>http://www.othersite.com/2.html</id>
> <content>bla2&lt;//content&gt;
> 
> I want to  create document in Solr:
> 
> <id>http://www.site.com/1.html</id>
> <content>bla2&lt;//content&gt;
> 
> Can this be done with DIH? And how?
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/Combine-XML-data-with-DIH-tp3209413p3209664.html
Sent from the Solr - User mailing list archive at Nabble.com.