You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kerwin <ke...@gmail.com> on 2009/11/16 09:12:24 UTC

Indexing multiple documents in Solr/SolrCell

Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index them
all without firing multiple URLs.

Example of XML
<add>
<doc>
<field name=id>34122</field>
<field name=author>Michael</field>
<field name=size>3MB</field>
<field name=URL>URL of the document</field>
</doc>
</add>
<doc2>.....</doc2>...</docN>

I need to index all these documents by sending this XML in a single URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?

Re: Indexing multiple documents in Solr/SolrCell

Posted by Sascha Szott <sz...@zib.de>.
Kewin,

Kerwin wrote:
> Our approach is similar to what you have mentioned in the jira issue except
> that we have all metadata in the xml and not in the database. I am therefore
> using a custom XmlUpdateRequestHandler to parse the XML and then calling
> Tika from within the XML Loader to parse the content. Until now this seems
> to work.
> When and in which Solr version do you expect the jira issue to be
> addressed?
That's a good question. Since I'm not a Solr committer, I cannot give 
any estimate on when it will be released (hopefully in Solr 1.5).

-Sascha

> On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott <sz...@zib.de> wrote:
> 
>> Hi,
>>
>> the problem you've described -- an integration of DataImportHandler (to
>> traverse the XML file and get the document urls) and Solr Cell (to extract
>> content afterwards) -- is already addressed in issue SOLR-1358 (
>> https://issues.apache.org/jira/browse/SOLR-1358).
>>
>> Best,
>> Sascha
>>
>>
>> Kerwin wrote:
>>
>>> Hi,
>>>
>>> I am new to this forum and would like to know if the function described
>>> below has been developed or exists in Solr. If it does not exist, is it a
>>> good Idea and can I contribute.
>>>
>>> We need to index multiple documents with different formats. So we use Solr
>>> with Tika (Solr Cell).
>>>
>>> Question:
>>> Can you index both metadata and content for multiple documents iteratively
>>> in Solr?
>>> For example I have an XML with metadata and a links to the documents
>>> content. There are many documents in this XML and I would like to index
>>> them
>>> all without firing multiple URLs.
>>>
>>> Example of XML
>>> <add>
>>> <doc>
>>> <field name=id>34122</field>
>>> <field name=author>Michael</field>
>>> <field name=size>3MB</field>
>>> <field name=URL>URL of the document</field>
>>> </doc>
>>> </add>
>>> <doc2>.....</doc2>...</docN>
>>>
>>> I need to index all these documents by sending this XML in a single
>>> URL.The
>>> collection of documents to be indexed could be on a file system.
>>>
>>> I have altered the Solr code to be able to do this but is there an already
>>> existing feature?
>>>
>>> 


Re: Indexing multiple documents in Solr/SolrCell

Posted by Kerwin <ke...@gmail.com>.
Hi Sascha,

Thanks for your reply.
Our approach is similar to what you have mentioned in the jira issue except
that we have all metadata in the xml and not in the database. I am therefore
using a custom XmlUpdateRequestHandler to parse the XML and then calling
Tika from within the XML Loader to parse the content. Until now this seems
to work.
When and in which Solr version do you expect the jira issue to be
addressed?


On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott <sz...@zib.de> wrote:

> Hi,
>
> the problem you've described -- an integration of DataImportHandler (to
> traverse the XML file and get the document urls) and Solr Cell (to extract
> content afterwards) -- is already addressed in issue SOLR-1358 (
> https://issues.apache.org/jira/browse/SOLR-1358).
>
> Best,
> Sascha
>
>
> Kerwin wrote:
>
>> Hi,
>>
>> I am new to this forum and would like to know if the function described
>> below has been developed or exists in Solr. If it does not exist, is it a
>> good Idea and can I contribute.
>>
>> We need to index multiple documents with different formats. So we use Solr
>> with Tika (Solr Cell).
>>
>> Question:
>> Can you index both metadata and content for multiple documents iteratively
>> in Solr?
>> For example I have an XML with metadata and a links to the documents
>> content. There are many documents in this XML and I would like to index
>> them
>> all without firing multiple URLs.
>>
>> Example of XML
>> <add>
>> <doc>
>> <field name=id>34122</field>
>> <field name=author>Michael</field>
>> <field name=size>3MB</field>
>> <field name=URL>URL of the document</field>
>> </doc>
>> </add>
>> <doc2>.....</doc2>...</docN>
>>
>> I need to index all these documents by sending this XML in a single
>> URL.The
>> collection of documents to be indexed could be on a file system.
>>
>> I have altered the Solr code to be able to do this but is there an already
>> existing feature?
>>
>>
>

Re: Indexing multiple documents in Solr/SolrCell

Posted by Sascha Szott <sz...@zib.de>.
Hi,

the problem you've described -- an integration of DataImportHandler (to 
traverse the XML file and get the document urls) and Solr Cell (to 
extract content afterwards) -- is already addressed in issue SOLR-1358 
(https://issues.apache.org/jira/browse/SOLR-1358).

Best,
Sascha

Kerwin wrote:
> Hi,
> 
> I am new to this forum and would like to know if the function described
> below has been developed or exists in Solr. If it does not exist, is it a
> good Idea and can I contribute.
> 
> We need to index multiple documents with different formats. So we use Solr
> with Tika (Solr Cell).
> 
> Question:
> Can you index both metadata and content for multiple documents iteratively
> in Solr?
> For example I have an XML with metadata and a links to the documents
> content. There are many documents in this XML and I would like to index them
> all without firing multiple URLs.
> 
> Example of XML
> <add>
> <doc>
> <field name=id>34122</field>
> <field name=author>Michael</field>
> <field name=size>3MB</field>
> <field name=URL>URL of the document</field>
> </doc>
> </add>
> <doc2>.....</doc2>...</docN>
> 
> I need to index all these documents by sending this XML in a single URL.The
> collection of documents to be indexed could be on a file system.
> 
> I have altered the Solr code to be able to do this but is there an already
> existing feature?
>