You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ma...@t-systems.com on 2013/05/23 12:14:13 UTC

index multiple files into one index entity

Hello solr team,

I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1


I try it with following source fragment:

    public void addContentSet(ContentSet contentSet) throws SearchProviderException {

                                ...

            ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
            String indexId = contentSet.getIndexId();

            ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
            server.request(csur);

                                ...
    }

    private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
            throws IOException {
        ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl());

        ModifiableSolrParams parameters = csur.getParams();
        if (parameters == null) {
            parameters = new ModifiableSolrParams();
        }

        parameters.set("literalsOverride", "false");

        // maps the tika default content attribute to the Attribute with name 'fulltext'
        parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
        // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
        csur.addContentStream(new ImaContentStream());

        for (Content content : contentSet.getContentList()) {
            csur.addContentStream(new ImaContentStream(content));
            // for each content stream add additional attributes
            parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
            parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
            parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
            parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
        }

        parameters.set("literal.id ", indexId);

        // adding some other attributes
        ...

        csur.setParams(parameters);

        return csur;
    }

During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
When I'm looking on solr catalina log I see that the attached files reach the solr servlet.

INFO: Releasing directory:/data/V-4-1/master0/data/index
Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
{add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58


But only the latest in the content list will be indexed.


My schema.xml has the following field definitions:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

    <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="contentmimetype" type="string" indexed="true" stored="true" multiValued="true"/>

    <field name="fulltext" type="text_general" indexed="true" stored="true" multiValued="true"/>


I'm using the tika ExtractingRequestHandler which can extract binary files.



  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>

    </lst>
  </requestHandler>

Is it possible to index multiple files with the same id?
It is necessary to implement my own RequestHandler?

With best regards Mark




Re: index multiple files into one index entity

Posted by Yury Kats <yu...@yahoo.com>.
No, the implementation was very specific to my needs.

On 5/27/2013 8:28 AM, Alexandre Rafalovitch wrote:
> You did not open source it by any chance? :-)
> 
> Regards,
>    Alex.


Re: index multiple files into one index entity

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You did not open source it by any chance? :-)

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, May 26, 2013 at 8:23 PM, Yury Kats <yu...@yahoo.com> wrote:
> That's exactly what happens. Each streams goes into a separate document.
> If all streams share the same unique id parameter, the last stream
> will overwrite everything.
>
> I've asked this same question last year. Got no responses and ended up
> writing my own UpdateRequestProcessor.
>
> See http://tinyurl.com/phhqsb4
>
> On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
>> If I understand correctly, the issue is:
>> 1) The client provides multiple content stream and expects Tika to
>> parse all of them and stick all the extracted content into one big
>> SolrDoc
>> 2) Tika (looking at load() method of: ExtractingDocumentLoader.java
>> (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
>> it's load method may be called multiple types and therefore happily
>> submit the document at the end of that call. Probably submits a new
>> document for each content source, which probably means it just
>> overrides the same doc over and over again.
>>
>> If I am right, then we have a bug in Tika handler's expectations (of
>> single load() call). The next step would be to put together a very
>> simple use case and open a Jira case with it.
>>
>> Regards,
>>    Alex.
>> P.s. I am not a Solr code wrangler, so this MAY be completely wrong.
>>
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
>> <er...@gmail.com> wrote:
>>> I'm still not quite getting the issue. Separate requests (i.e. any
>>> addition of a SolrInputDocument) are treated as a separate document.
>>> There's no notion of "append the contents of one doc to another based
>>> on ID", unless you're doing atomic updates.
>>>
>>> And Tika takes some care to index separate files as separate documents.
>>>
>>> Now, if you don't need these as with the same uniqueKey, you might
>>> index them as separate documents and include a field that lets you
>>> associate these documents somehow (see the group/field collapsing Wiki
>>> page).
>>>
>>> But otherwise, I think I need a higher-level view of what you're
>>> trying to accomplish to make an intelligent comment.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, May 23, 2013 at 9:05 AM,  <Ma...@t-systems.com> wrote:
>>>> Hello Erick,
>>>> Thank you for your fast answer.
>>>>
>>>> Maybe I don't exclaim my question clearly.
>>>> I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id.
>>>> So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity.
>>>>
>>>> Thank you and with best Regards
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>>>> Gesendet: Donnerstag, 23. Mai 2013 14:11
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Re: index multiple files into one index entity
>>>>
>>>> I just skimmed your post, but I'm responding to the last bit.
>>>>
>>>> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot have multiple documents with the same ID.
>>>> Whenever a new doc comes in it replaces the old doc with that ID.
>>>>
>>>> You can remove the <uniqueKey> definition and do what you want, but there are very few Solr installations with no <uniqueKey> and it's probably a better idea to make your id's truly unique.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
>>>>> Hello solr team,
>>>>>
>>>>> I want to index multiple fields into one solr index entity, with the
>>>>> same id. We are using solr 4.1
>>>>>
>>>>>
>>>>> I try it with following source fragment:
>>>>>
>>>>>     public void addContentSet(ContentSet contentSet) throws
>>>>> SearchProviderException {
>>>>>
>>>>>                                 ...
>>>>>
>>>>>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>>>>>             String indexId = contentSet.getIndexId();
>>>>>
>>>>>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>>>>>             server.request(csur);
>>>>>
>>>>>                                 ...
>>>>>     }
>>>>>
>>>>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>>>>>             throws IOException {
>>>>>         ContentStreamUpdateRequest csur = new
>>>>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>>>>
>>>>>         ModifiableSolrParams parameters = csur.getParams();
>>>>>         if (parameters == null) {
>>>>>             parameters = new ModifiableSolrParams();
>>>>>         }
>>>>>
>>>>>         parameters.set("literalsOverride", "false");
>>>>>
>>>>>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>>>>>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>>>>>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>>>>>         csur.addContentStream(new ImaContentStream());
>>>>>
>>>>>         for (Content content : contentSet.getContentList()) {
>>>>>             csur.addContentStream(new ImaContentStream(content));
>>>>>             // for each content stream add additional attributes
>>>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>>>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>>>>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>>>>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>>>>         }
>>>>>
>>>>>         parameters.set("literal.id ", indexId);
>>>>>
>>>>>         // adding some other attributes
>>>>>         ...
>>>>>
>>>>>         csur.setParams(parameters);
>>>>>
>>>>>         return csur;
>>>>>     }
>>>>>
>>>>> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
>>>>> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>>>>>
>>>>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>>>>> Apr 25, 2013 5:48:07 AM
>>>>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>>>> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>>>>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>>>>
>>>>>
>>>>> But only the latest in the content list will be indexed.
>>>>>
>>>>>
>>>>> My schema.xml has the following field definitions:
>>>>>
>>>>>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>>>>>     <field name="content" type="text_general" indexed="false"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>>>>>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>>>>>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>>>>>     <field name="contentmimetype" type="string" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>     <field name="fulltext" type="text_general" indexed="true"
>>>>> stored="true" multiValued="true"/>
>>>>>
>>>>>
>>>>> I'm using the tika ExtractingRequestHandler which can extract binary files.
>>>>>
>>>>>
>>>>>
>>>>>   <requestHandler name="/update/extract"
>>>>>                   startup="lazy"
>>>>>                   class="solr.extraction.ExtractingRequestHandler" >
>>>>>     <lst name="defaults">
>>>>>       <str name="lowernames">true</str>
>>>>>       <str name="uprefix">ignored_</str>
>>>>>
>>>>>       <!-- capture link hrefs but ignore div attributes -->
>>>>>       <str name="captureAttr">true</str>
>>>>>       <str name="fmap.a">links</str>
>>>>>       <str name="fmap.div">ignored_</str>
>>>>>
>>>>>     </lst>
>>>>>   </requestHandler>
>>>>>
>>>>> Is it possible to index multiple files with the same id?
>>>>> It is necessary to implement my own RequestHandler?
>>>>>
>>>>> With best regards Mark
>>>>>
>>>>>
>>>>>
>>
>

Re: index multiple files into one index entity

Posted by Yury Kats <yu...@yahoo.com>.
That's exactly what happens. Each streams goes into a separate document.
If all streams share the same unique id parameter, the last stream
will overwrite everything.

I've asked this same question last year. Got no responses and ended up
writing my own UpdateRequestProcessor.

See http://tinyurl.com/phhqsb4

On 5/26/2013 11:15 AM, Alexandre Rafalovitch wrote:
> If I understand correctly, the issue is:
> 1) The client provides multiple content stream and expects Tika to
> parse all of them and stick all the extracted content into one big
> SolrDoc
> 2) Tika (looking at load() method of: ExtractingDocumentLoader.java
> (Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
> it's load method may be called multiple types and therefore happily
> submit the document at the end of that call. Probably submits a new
> document for each content source, which probably means it just
> overrides the same doc over and over again.
> 
> If I am right, then we have a bug in Tika handler's expectations (of
> single load() call). The next step would be to put together a very
> simple use case and open a Jira case with it.
> 
> Regards,
>    Alex.
> P.s. I am not a Solr code wrangler, so this MAY be completely wrong.
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
> <er...@gmail.com> wrote:
>> I'm still not quite getting the issue. Separate requests (i.e. any
>> addition of a SolrInputDocument) are treated as a separate document.
>> There's no notion of "append the contents of one doc to another based
>> on ID", unless you're doing atomic updates.
>>
>> And Tika takes some care to index separate files as separate documents.
>>
>> Now, if you don't need these as with the same uniqueKey, you might
>> index them as separate documents and include a field that lets you
>> associate these documents somehow (see the group/field collapsing Wiki
>> page).
>>
>> But otherwise, I think I need a higher-level view of what you're
>> trying to accomplish to make an intelligent comment.
>>
>> Best
>> Erick
>>
>> On Thu, May 23, 2013 at 9:05 AM,  <Ma...@t-systems.com> wrote:
>>> Hello Erick,
>>> Thank you for your fast answer.
>>>
>>> Maybe I don't exclaim my question clearly.
>>> I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id.
>>> So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity.
>>>
>>> Thank you and with best Regards
>>> Mark
>>>
>>>
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Gesendet: Donnerstag, 23. Mai 2013 14:11
>>> An: solr-user@lucene.apache.org
>>> Betreff: Re: index multiple files into one index entity
>>>
>>> I just skimmed your post, but I'm responding to the last bit.
>>>
>>> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot have multiple documents with the same ID.
>>> Whenever a new doc comes in it replaces the old doc with that ID.
>>>
>>> You can remove the <uniqueKey> definition and do what you want, but there are very few Solr installations with no <uniqueKey> and it's probably a better idea to make your id's truly unique.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
>>>> Hello solr team,
>>>>
>>>> I want to index multiple fields into one solr index entity, with the
>>>> same id. We are using solr 4.1
>>>>
>>>>
>>>> I try it with following source fragment:
>>>>
>>>>     public void addContentSet(ContentSet contentSet) throws
>>>> SearchProviderException {
>>>>
>>>>                                 ...
>>>>
>>>>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>>>>             String indexId = contentSet.getIndexId();
>>>>
>>>>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>>>>             server.request(csur);
>>>>
>>>>                                 ...
>>>>     }
>>>>
>>>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>>>>             throws IOException {
>>>>         ContentStreamUpdateRequest csur = new
>>>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>>>
>>>>         ModifiableSolrParams parameters = csur.getParams();
>>>>         if (parameters == null) {
>>>>             parameters = new ModifiableSolrParams();
>>>>         }
>>>>
>>>>         parameters.set("literalsOverride", "false");
>>>>
>>>>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>>>>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>>>>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>>>>         csur.addContentStream(new ImaContentStream());
>>>>
>>>>         for (Content content : contentSet.getContentList()) {
>>>>             csur.addContentStream(new ImaContentStream(content));
>>>>             // for each content stream add additional attributes
>>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>>>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>>>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>>>         }
>>>>
>>>>         parameters.set("literal.id ", indexId);
>>>>
>>>>         // adding some other attributes
>>>>         ...
>>>>
>>>>         csur.setParams(parameters);
>>>>
>>>>         return csur;
>>>>     }
>>>>
>>>> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
>>>> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>>>>
>>>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>>>> Apr 25, 2013 5:48:07 AM
>>>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>>> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>>>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>>>
>>>>
>>>> But only the latest in the content list will be indexed.
>>>>
>>>>
>>>> My schema.xml has the following field definitions:
>>>>
>>>>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>>>>     <field name="content" type="text_general" indexed="false"
>>>> stored="true" multiValued="true"/>
>>>>
>>>>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>>>>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>>>>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>>>>     <field name="contentmimetype" type="string" indexed="true"
>>>> stored="true" multiValued="true"/>
>>>>
>>>>     <field name="fulltext" type="text_general" indexed="true"
>>>> stored="true" multiValued="true"/>
>>>>
>>>>
>>>> I'm using the tika ExtractingRequestHandler which can extract binary files.
>>>>
>>>>
>>>>
>>>>   <requestHandler name="/update/extract"
>>>>                   startup="lazy"
>>>>                   class="solr.extraction.ExtractingRequestHandler" >
>>>>     <lst name="defaults">
>>>>       <str name="lowernames">true</str>
>>>>       <str name="uprefix">ignored_</str>
>>>>
>>>>       <!-- capture link hrefs but ignore div attributes -->
>>>>       <str name="captureAttr">true</str>
>>>>       <str name="fmap.a">links</str>
>>>>       <str name="fmap.div">ignored_</str>
>>>>
>>>>     </lst>
>>>>   </requestHandler>
>>>>
>>>> Is it possible to index multiple files with the same id?
>>>> It is necessary to implement my own RequestHandler?
>>>>
>>>> With best regards Mark
>>>>
>>>>
>>>>
> 


Re: index multiple files into one index entity

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
If I understand correctly, the issue is:
1) The client provides multiple content stream and expects Tika to
parse all of them and stick all the extracted content into one big
SolrDoc
2) Tika (looking at load() method of: ExtractingDocumentLoader.java
(Github link: http://bit.ly/12GsDl9 ) does not actually suspect that
it's load method may be called multiple types and therefore happily
submit the document at the end of that call. Probably submits a new
document for each content source, which probably means it just
overrides the same doc over and over again.

If I am right, then we have a bug in Tika handler's expectations (of
single load() call). The next step would be to put together a very
simple use case and open a Jira case with it.

Regards,
   Alex.
P.s. I am not a Solr code wrangler, so this MAY be completely wrong.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, May 26, 2013 at 10:46 AM, Erick Erickson
<er...@gmail.com> wrote:
> I'm still not quite getting the issue. Separate requests (i.e. any
> addition of a SolrInputDocument) are treated as a separate document.
> There's no notion of "append the contents of one doc to another based
> on ID", unless you're doing atomic updates.
>
> And Tika takes some care to index separate files as separate documents.
>
> Now, if you don't need these as with the same uniqueKey, you might
> index them as separate documents and include a field that lets you
> associate these documents somehow (see the group/field collapsing Wiki
> page).
>
> But otherwise, I think I need a higher-level view of what you're
> trying to accomplish to make an intelligent comment.
>
> Best
> Erick
>
> On Thu, May 23, 2013 at 9:05 AM,  <Ma...@t-systems.com> wrote:
>> Hello Erick,
>> Thank you for your fast answer.
>>
>> Maybe I don't exclaim my question clearly.
>> I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id.
>> So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity.
>>
>> Thank you and with best Regards
>> Mark
>>
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerickson@gmail.com]
>> Gesendet: Donnerstag, 23. Mai 2013 14:11
>> An: solr-user@lucene.apache.org
>> Betreff: Re: index multiple files into one index entity
>>
>> I just skimmed your post, but I'm responding to the last bit.
>>
>> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot have multiple documents with the same ID.
>> Whenever a new doc comes in it replaces the old doc with that ID.
>>
>> You can remove the <uniqueKey> definition and do what you want, but there are very few Solr installations with no <uniqueKey> and it's probably a better idea to make your id's truly unique.
>>
>> Best
>> Erick
>>
>> On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
>>> Hello solr team,
>>>
>>> I want to index multiple fields into one solr index entity, with the
>>> same id. We are using solr 4.1
>>>
>>>
>>> I try it with following source fragment:
>>>
>>>     public void addContentSet(ContentSet contentSet) throws
>>> SearchProviderException {
>>>
>>>                                 ...
>>>
>>>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>>>             String indexId = contentSet.getIndexId();
>>>
>>>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>>>             server.request(csur);
>>>
>>>                                 ...
>>>     }
>>>
>>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>>>             throws IOException {
>>>         ContentStreamUpdateRequest csur = new
>>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>>
>>>         ModifiableSolrParams parameters = csur.getParams();
>>>         if (parameters == null) {
>>>             parameters = new ModifiableSolrParams();
>>>         }
>>>
>>>         parameters.set("literalsOverride", "false");
>>>
>>>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>>>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>>>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>>>         csur.addContentStream(new ImaContentStream());
>>>
>>>         for (Content content : contentSet.getContentList()) {
>>>             csur.addContentStream(new ImaContentStream(content));
>>>             // for each content stream add additional attributes
>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>>         }
>>>
>>>         parameters.set("literal.id ", indexId);
>>>
>>>         // adding some other attributes
>>>         ...
>>>
>>>         csur.setParams(parameters);
>>>
>>>         return csur;
>>>     }
>>>
>>> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
>>> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>>>
>>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>>> Apr 25, 2013 5:48:07 AM
>>> org.apache.solr.update.processor.LogUpdateProcessor finish
>>> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>>
>>>
>>> But only the latest in the content list will be indexed.
>>>
>>>
>>> My schema.xml has the following field definitions:
>>>
>>>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>>>     <field name="content" type="text_general" indexed="false"
>>> stored="true" multiValued="true"/>
>>>
>>>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>>>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>>>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>>>     <field name="contentmimetype" type="string" indexed="true"
>>> stored="true" multiValued="true"/>
>>>
>>>     <field name="fulltext" type="text_general" indexed="true"
>>> stored="true" multiValued="true"/>
>>>
>>>
>>> I'm using the tika ExtractingRequestHandler which can extract binary files.
>>>
>>>
>>>
>>>   <requestHandler name="/update/extract"
>>>                   startup="lazy"
>>>                   class="solr.extraction.ExtractingRequestHandler" >
>>>     <lst name="defaults">
>>>       <str name="lowernames">true</str>
>>>       <str name="uprefix">ignored_</str>
>>>
>>>       <!-- capture link hrefs but ignore div attributes -->
>>>       <str name="captureAttr">true</str>
>>>       <str name="fmap.a">links</str>
>>>       <str name="fmap.div">ignored_</str>
>>>
>>>     </lst>
>>>   </requestHandler>
>>>
>>> Is it possible to index multiple files with the same id?
>>> It is necessary to implement my own RequestHandler?
>>>
>>> With best regards Mark
>>>
>>>
>>>

Re: index multiple files into one index entity

Posted by Erick Erickson <er...@gmail.com>.
I'm still not quite getting the issue. Separate requests (i.e. any
addition of a SolrInputDocument) are treated as a separate document.
There's no notion of "append the contents of one doc to another based
on ID", unless you're doing atomic updates.

And Tika takes some care to index separate files as separate documents.

Now, if you don't need these as with the same uniqueKey, you might
index them as separate documents and include a field that lets you
associate these documents somehow (see the group/field collapsing Wiki
page).

But otherwise, I think I need a higher-level view of what you're
trying to accomplish to make an intelligent comment.

Best
Erick

On Thu, May 23, 2013 at 9:05 AM,  <Ma...@t-systems.com> wrote:
> Hello Erick,
> Thank you for your fast answer.
>
> Maybe I don't exclaim my question clearly.
> I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id.
> So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity.
>
> Thank you and with best Regards
> Mark
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Donnerstag, 23. Mai 2013 14:11
> An: solr-user@lucene.apache.org
> Betreff: Re: index multiple files into one index entity
>
> I just skimmed your post, but I'm responding to the last bit.
>
> If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot have multiple documents with the same ID.
> Whenever a new doc comes in it replaces the old doc with that ID.
>
> You can remove the <uniqueKey> definition and do what you want, but there are very few Solr installations with no <uniqueKey> and it's probably a better idea to make your id's truly unique.
>
> Best
> Erick
>
> On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
>> Hello solr team,
>>
>> I want to index multiple fields into one solr index entity, with the
>> same id. We are using solr 4.1
>>
>>
>> I try it with following source fragment:
>>
>>     public void addContentSet(ContentSet contentSet) throws
>> SearchProviderException {
>>
>>                                 ...
>>
>>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>>             String indexId = contentSet.getIndexId();
>>
>>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>>             server.request(csur);
>>
>>                                 ...
>>     }
>>
>>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>>             throws IOException {
>>         ContentStreamUpdateRequest csur = new
>> ContentStreamUpdateRequest(confStore.getExtractUrl());
>>
>>         ModifiableSolrParams parameters = csur.getParams();
>>         if (parameters == null) {
>>             parameters = new ModifiableSolrParams();
>>         }
>>
>>         parameters.set("literalsOverride", "false");
>>
>>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>>         csur.addContentStream(new ImaContentStream());
>>
>>         for (Content content : contentSet.getContentList()) {
>>             csur.addContentStream(new ImaContentStream(content));
>>             // for each content stream add additional attributes
>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>>         }
>>
>>         parameters.set("literal.id ", indexId);
>>
>>         // adding some other attributes
>>         ...
>>
>>         csur.setParams(parameters);
>>
>>         return csur;
>>     }
>>
>> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
>> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>>
>> INFO: Releasing directory:/data/V-4-1/master0/data/index
>> Apr 25, 2013 5:48:07 AM
>> org.apache.solr.update.processor.LogUpdateProcessor finish
>> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
>> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216),
>> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>>
>>
>> But only the latest in the content list will be indexed.
>>
>>
>> My schema.xml has the following field definitions:
>>
>>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>>     <field name="content" type="text_general" indexed="false"
>> stored="true" multiValued="true"/>
>>
>>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>>     <field name="contentmimetype" type="string" indexed="true"
>> stored="true" multiValued="true"/>
>>
>>     <field name="fulltext" type="text_general" indexed="true"
>> stored="true" multiValued="true"/>
>>
>>
>> I'm using the tika ExtractingRequestHandler which can extract binary files.
>>
>>
>>
>>   <requestHandler name="/update/extract"
>>                   startup="lazy"
>>                   class="solr.extraction.ExtractingRequestHandler" >
>>     <lst name="defaults">
>>       <str name="lowernames">true</str>
>>       <str name="uprefix">ignored_</str>
>>
>>       <!-- capture link hrefs but ignore div attributes -->
>>       <str name="captureAttr">true</str>
>>       <str name="fmap.a">links</str>
>>       <str name="fmap.div">ignored_</str>
>>
>>     </lst>
>>   </requestHandler>
>>
>> Is it possible to index multiple files with the same id?
>> It is necessary to implement my own RequestHandler?
>>
>> With best regards Mark
>>
>>
>>

AW: index multiple files into one index entity

Posted by Ma...@t-systems.com.
Hello Erick,
Thank you for your fast answer.

Maybe I don't exclaim my question clearly.
I want index many files to one index entity. I will use the same behavior as any other multivalued field which can indexed to one unique id.
So I think every ContentStreamUpdateRequest represent one index entity, isn't it? And with each addContentStream I will add one File to this entity.

Thank you and with best Regards
Mark




-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerickson@gmail.com] 
Gesendet: Donnerstag, 23. Mai 2013 14:11
An: solr-user@lucene.apache.org
Betreff: Re: index multiple files into one index entity

I just skimmed your post, but I'm responding to the last bit.

If you have <uniqueKey> defined as "id" in schema.xml then no, you cannot have multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.

You can remove the <uniqueKey> definition and do what you want, but there are very few Solr installations with no <uniqueKey> and it's probably a better idea to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
> Hello solr team,
>
> I want to index multiple fields into one solr index entity, with the 
> same id. We are using solr 4.1
>
>
> I try it with following source fragment:
>
>     public void addContentSet(ContentSet contentSet) throws 
> SearchProviderException {
>
>                                 ...
>
>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>             String indexId = contentSet.getIndexId();
>
>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>             server.request(csur);
>
>                                 ...
>     }
>
>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>             throws IOException {
>         ContentStreamUpdateRequest csur = new 
> ContentStreamUpdateRequest(confStore.getExtractUrl());
>
>         ModifiableSolrParams parameters = csur.getParams();
>         if (parameters == null) {
>             parameters = new ModifiableSolrParams();
>         }
>
>         parameters.set("literalsOverride", "false");
>
>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>         csur.addContentStream(new ImaContentStream());
>
>         for (Content content : contentSet.getContentList()) {
>             csur.addContentStream(new ImaContentStream(content));
>             // for each content stream add additional attributes
>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>         }
>
>         parameters.set("literal.id ", indexId);
>
>         // adding some other attributes
>         ...
>
>         csur.setParams(parameters);
>
>         return csur;
>     }
>
> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>
> INFO: Releasing directory:/data/V-4-1/master0/data/index
> Apr 25, 2013 5:48:07 AM 
> org.apache.solr.update.processor.LogUpdateProcessor finish
> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 
> 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>
>
> But only the latest in the content list will be indexed.
>
>
> My schema.xml has the following field definitions:
>
>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>     <field name="content" type="text_general" indexed="false" 
> stored="true" multiValued="true"/>
>
>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentmimetype" type="string" indexed="true" 
> stored="true" multiValued="true"/>
>
>     <field name="fulltext" type="text_general" indexed="true" 
> stored="true" multiValued="true"/>
>
>
> I'm using the tika ExtractingRequestHandler which can extract binary files.
>
>
>
>   <requestHandler name="/update/extract"
>                   startup="lazy"
>                   class="solr.extraction.ExtractingRequestHandler" >
>     <lst name="defaults">
>       <str name="lowernames">true</str>
>       <str name="uprefix">ignored_</str>
>
>       <!-- capture link hrefs but ignore div attributes -->
>       <str name="captureAttr">true</str>
>       <str name="fmap.a">links</str>
>       <str name="fmap.div">ignored_</str>
>
>     </lst>
>   </requestHandler>
>
> Is it possible to index multiple files with the same id?
> It is necessary to implement my own RequestHandler?
>
> With best regards Mark
>
>
>

Re: index multiple files into one index entity

Posted by Erick Erickson <er...@gmail.com>.
I just skimmed your post, but I'm responding to the last bit.

If you have <uniqueKey> defined as "id" in schema.xml then
no, you cannot have multiple documents with the same ID.
Whenever a new doc comes in it replaces the old doc with that ID.

You can remove the <uniqueKey> definition and do what you want,
but there are very few Solr installations with no <uniqueKey> and
it's probably a better idea to make your id's truly unique.

Best
Erick

On Thu, May 23, 2013 at 6:14 AM,  <Ma...@t-systems.com> wrote:
> Hello solr team,
>
> I want to index multiple fields into one solr index entity, with the same id. We are using solr 4.1
>
>
> I try it with following source fragment:
>
>     public void addContentSet(ContentSet contentSet) throws SearchProviderException {
>
>                                 ...
>
>             ContentStreamUpdateRequest csur = generateCSURequest(contentSet.getIndexId(), contentSet);
>             String indexId = contentSet.getIndexId();
>
>             ConcurrentUpdateSolrServer server = serverPool.getUpdateServer(indexId);
>             server.request(csur);
>
>                                 ...
>     }
>
>     private ContentStreamUpdateRequest generateCSURequest(String indexId, ContentSet contentSet)
>             throws IOException {
>         ContentStreamUpdateRequest csur = new ContentStreamUpdateRequest(confStore.getExtractUrl());
>
>         ModifiableSolrParams parameters = csur.getParams();
>         if (parameters == null) {
>             parameters = new ModifiableSolrParams();
>         }
>
>         parameters.set("literalsOverride", "false");
>
>         // maps the tika default content attribute to the Attribute with name 'fulltext'
>         parameters.set("fmap.content", SearchSystemAttributeDef.FULLTEXT.getName());
>         // create an empty content stream, this seams necessary for ContentStreamUpdateRequest
>         csur.addContentStream(new ImaContentStream());
>
>         for (Content content : contentSet.getContentList()) {
>             csur.addContentStream(new ImaContentStream(content));
>             // for each content stream add additional attributes
>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_ID.getName(), content.getBinaryObjectId().toString());
>             parameters.add("literal." + SearchSystemAttributeDef.CONTENT_KEY.getName(), content.getContentKey());
>             parameters.add("literal." + SearchSystemAttributeDef.FILE_NAME.getName(), content.getContentName());
>             parameters.add("literal." + SearchSystemAttributeDef.MIME_TYPE.getName(), content.getMimeType());
>         }
>
>         parameters.set("literal.id ", indexId);
>
>         // adding some other attributes
>         ...
>
>         csur.setParams(parameters);
>
>         return csur;
>     }
>
> During debugging I can see that the method 'server.request(csur)' read for each ImaContentStream the buffer.
> When I'm looking on solr catalina log I see that the attached files reach the solr servlet.
>
> INFO: Releasing directory:/data/V-4-1/master0/data/index
> Apr 25, 2013 5:48:07 AM org.apache.solr.update.processor.LogUpdateProcessor finish
> INFO: [master0] webapp=/solr-4-1 path=/update/extract params={literal.searchconnectortest15_c8150e41_cc49_4a ...... &literal.id=26afa5dc-40ad-442a-ac79-0e7880c06aa1& .....
> {add=[26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910940958720), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910971367424), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910976610304), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910983950336), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910989193216), 26afa5dc-40ad-442a-ac79-0e7880c06aa1 (1433265910995484672)]} 0 58
>
>
> But only the latest in the content list will be indexed.
>
>
> My schema.xml has the following field definitions:
>
>     <field name="id" type="string" indexed="true" stored="true" required="true" />
>     <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
>
>     <field name="contentkey" type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentid" type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentfilename " type="string" indexed="true" stored="true" multiValued="true"/>
>     <field name="contentmimetype" type="string" indexed="true" stored="true" multiValued="true"/>
>
>     <field name="fulltext" type="text_general" indexed="true" stored="true" multiValued="true"/>
>
>
> I'm using the tika ExtractingRequestHandler which can extract binary files.
>
>
>
>   <requestHandler name="/update/extract"
>                   startup="lazy"
>                   class="solr.extraction.ExtractingRequestHandler" >
>     <lst name="defaults">
>       <str name="lowernames">true</str>
>       <str name="uprefix">ignored_</str>
>
>       <!-- capture link hrefs but ignore div attributes -->
>       <str name="captureAttr">true</str>
>       <str name="fmap.a">links</str>
>       <str name="fmap.div">ignored_</str>
>
>     </lst>
>   </requestHandler>
>
> Is it possible to index multiple files with the same id?
> It is necessary to implement my own RequestHandler?
>
> With best regards Mark
>
>
>