You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Dileepa Jayakody <dj...@zaizi.com> on 2015/08/10 15:05:20 UTC

Indexing Solr documents with atomic updates using manifoldcf solr connector

Hi All,

We have a requirement to extract some meta-data from content documents and
index those meta-data as separate documents into a Solr index.
I'm writing a transformation connector where I construct a new repository
document adding the meta-data extracted by the connector and hand it over
to mcf-solr-connector to index in Solr.
Currently I face some difficulties with indexing these new documents in
Solr properly using solr-connector.

The new solr document should contain some atomic updates for certain
fields. So in my connector I create a JSON to represent the Solr atomic
update request and set is as the binaryStream of the repository
document.The json string for the new solr document is as below;

String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
","label":"Africa","documents":{"add":"sample2.txt"}}]";


Then, I add an id and set above jsonString as the binary input stream of
the repo-document as follows;

repoDoc.addField( "id", idString );
InputStream inputStream = IOUtils.toInputStream( jsonString );
repoDoc.setBinary(inputStream, jsonString.getBytes().length);

The expected behavior is Solr connector sending the SolrInputDocument
constructed from the inputStream I added to the repo-document from my
connector. But instead it adds the JSON  string to the  'content' field of
the solr-document and sends to Solr.

When I monitored the HTTP request from manifold to Solr I see below;

POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
<add>
   <doc boost="1.0">
      <field name="id">http://dbpedia.org/resource/Africa</field>
      <field name="_root_">[{"id":"http://dbpedia.org/resource/Africa
","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
      <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
</field>
   </doc></add>0

Please note that the 'content' field configured in manifoldcf is *_root_*.

But the expected Solr update request from solr-connector should be as below;
<add>
   <doc boost="1.0">
    <field name="id">http://dbpedia.org/resource/Africa</field>
     <field name="label">Africa</field>
      <field name="documents" update="add">sample2.txt</field>
     <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
</field>
   </doc></add>0


Can someone please give some advice on how to use solr atomic updates with
manifoldcf solr-connector? Have I missed some configurations/arguments?

Thanks,
Dileepa

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: Indexing Solr documents with atomic updates using manifoldcf solr connector

Posted by Karl Wright <da...@gmail.com>.
That is definitely a back-door approach, in that your transformation
connector will work ONLY with the solr output connector, and may indeed
stop working if the details of the solr connector change in the future.
But I will take it to mean that any changes MCF does to its APIs are not
urgent.

Thanks,
Karl


On Tue, Aug 11, 2015 at 3:14 AM, Dileepa Jayakody <dj...@zaizi.com>
wrote:

> Hi Karl,
>
> Thanks a lot for the detailed explanation.
>
> I was able to get my usecase working by configuring the solr-connector to
> create solrj ContentStreamUpdateRequest where it constructs the solr update
> request using the RepositoryDocumentStream.
>
> I just had to tick "Use the Extract Update Handler" option in
> solr-connector's schema configuration section and use /update/json handler
> to index the content in Solr.
>
> So I could create multiple child repository documents in my transformation
> connector, set the content of it to a JSON and use solrj
> ContentStreamUpdateRequest to send it to Solr and index the child documents
> separately using /update/json handler.
>
> Thank you very much for all the help.
>
> Regards,
> Dileepa
>
>
> On Tue, Aug 11, 2015 at 11:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Dileepa,
> >
> > The only current way for ManifoldCF to track documents that are related
> in
> > a parent-child relationship is using the "document component" mechanism.
> > This is appropriate when a repository is structured so that a single
> > document being processed results in multiple documents being indexed.
> > Document components are determined and managed by the repository
> connector,
> > NOT by a transformation connector or output connector.  Each
> > RepositoryDocument still represents a single document, never multiple
> > documents, and even if you could create document components in a
> > transformation connector, they would be tracked individually in MCF and
> > indexed completely independently in Solr.
> >
> > So, to do what you want sounds like it would require a different
> approach,
> > specifically the extension of RepositoryDocument to handle multiple
> > atomically-related logical documents at one time.  This is something that
> > would require API changes.  However, if such a thing were attempted, the
> > entire set of related documents represented by a single
> RepositoryDocument
> > would all be indexed at one time, atomically, which sounds like it also
> > might not be what you want.  It sounds to me like you are still trying to
> > pursue your idea of indexing individual fields independently, is that
> > correct?
> >
> > Karl
> >
> >
> > On Tue, Aug 11, 2015 at 12:44 AM, Dileepa Jayakody <dj...@zaizi.com>
> > wrote:
> >
> > > Hi Karl,
> > >
> > > Thanks for your response. My requirement is indexing child documents
> > > constructed from the content repo.document as separate Solr documents.
> So
> > > adding meta-data fields to the original repository document wouldn't
> help
> > > my scenario AFAIU.
> > >
> > > My transformation connector is somewhat similar to the Stanbol
> > > transformation connector proposed in manifoldcf jira [1].
> > > What I referred as meta-data are the Named Entity Recognition data
> (NER)
> > > extracted from the repository document. So each content repository
> > document
> > > will have multiple NER child documents. These NERs are expected to be
> > > indexed as separate Solr documents having a mapping to the parent
> content
> > > repository document which the NERs were extracted from.
> > > So apart from indexing the content repository document in Solr, I need
> to
> > > index all NER child documents with their attributes as separate
> documents
> > > in Solr.
> > >
> > > Above example is how I create a child repo document for NER. I set the
> > > entire NER document as the binary stream of the child repository
> document
> > > which is then sent to mcf-solr connector.
> > >
> > > In the mcf-solr connector (In HttpPoster class) when building the solr
> > > document from the repository document's input stream, it adds the
> > > inputStream String as a field to the content field of the Solr document
> > > configured by solr-connector as below;
> > >
> > > buildSorDocument(long length, InputStream is){
> > >
> > > if (contentAttributeName != null)
> > >       {
> > >         Reader r = new InputStreamReader(is, Consts.UTF_8);
> > >         StringBuilder sb = new StringBuilder((int)length);
> > >         char[] buffer = new char[65536];
> > >         while (true)
> > >         {
> > >           int amt = r.read(buffer,0,buffer.length);
> > >           if (amt == -1)
> > >             break;
> > >           sb.append(buffer,0,amt);
> > >         }
> > >
> > >         outputDoc.addField( contentAttributeName, sb.toString() );
> > >       }
> > > ....
> > > }
> > >
> > > Therefore the solr-connector sends the JSON update request I
> constructed
> > in
> > > my connector as a field value of the  Solr document, not as the whole
> > Solr
> > > document.
> > >
> > > Can you please give me some advice on how to index nested child
> documents
> > > in Solr using Manifold?
> > >
> > > Thanks,
> > > Dileepa
> > >
> > > [1] https://issues.apache.org/jira/browse/CONNECTORS-1181
> > >
> > > On Mon, Aug 10, 2015 at 6:47 PM, Karl Wright <da...@gmail.com>
> wrote:
> > >
> > > > Hi Dileepa,
> > > >
> > > > In order for ManifoldCF to index metadata, you need to set metadata
> > field
> > > > values in the RepositoryDocument object, not send Solr JSON as the
> > > > document's content.  In fact from your example it looks like you want
> > > zero
> > > > content.
> > > >
> > > > Please read the RepositoryDocument java doc to see how you set
> > metadata.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody <
> djayakody@zaizi.com
> > >
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We have a requirement to extract some meta-data from content
> > documents
> > > > and
> > > > > index those meta-data as separate documents into a Solr index.
> > > > > I'm writing a transformation connector where I construct a new
> > > repository
> > > > > document adding the meta-data extracted by the connector and hand
> it
> > > over
> > > > > to mcf-solr-connector to index in Solr.
> > > > > Currently I face some difficulties with indexing these new
> documents
> > in
> > > > > Solr properly using solr-connector.
> > > > >
> > > > > The new solr document should contain some atomic updates for
> certain
> > > > > fields. So in my connector I create a JSON to represent the Solr
> > atomic
> > > > > update request and set is as the binaryStream of the repository
> > > > > document.The json string for the new solr document is as below;
> > > > >
> > > > > String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
> > > > > ","label":"Africa","documents":{"add":"sample2.txt"}}]";
> > > > >
> > > > >
> > > > > Then, I add an id and set above jsonString as the binary input
> stream
> > > of
> > > > > the repo-document as follows;
> > > > >
> > > > > repoDoc.addField( "id", idString );
> > > > > InputStream inputStream = IOUtils.toInputStream( jsonString );
> > > > > repoDoc.setBinary(inputStream, jsonString.getBytes().length);
> > > > >
> > > > > The expected behavior is Solr connector sending the
> SolrInputDocument
> > > > > constructed from the inputStream I added to the repo-document from
> my
> > > > > connector. But instead it adds the JSON  string to the  'content'
> > field
> > > > of
> > > > > the solr-document and sends to Solr.
> > > > >
> > > > > When I monitored the HTTP request from manifold to Solr I see
> below;
> > > > >
> > > > > POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
> > > > > <add>
> > > > >    <doc boost="1.0">
> > > > >       <field name="id">http://dbpedia.org/resource/Africa</field>
> > > > >       <field name="_root_">[{"id":"
> > http://dbpedia.org/resource/Africa
> > > > > ","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
> > > > >       <field name="lcf_metadata_id">
> > http://dbpedia.org/resource/Africa
> > > > > </field>
> > > > >    </doc></add>0
> > > > >
> > > > > Please note that the 'content' field configured in manifoldcf is
> > > > *_root_*.
> > > > >
> > > > > But the expected Solr update request from solr-connector should be
> as
> > > > > below;
> > > > > <add>
> > > > >    <doc boost="1.0">
> > > > >     <field name="id">http://dbpedia.org/resource/Africa</field>
> > > > >      <field name="label">Africa</field>
> > > > >       <field name="documents" update="add">sample2.txt</field>
> > > > >      <field name="lcf_metadata_id">
> > http://dbpedia.org/resource/Africa
> > > > > </field>
> > > > >    </doc></add>0
> > > > >
> > > > >
> > > > > Can someone please give some advice on how to use solr atomic
> updates
> > > > with
> > > > > manifoldcf solr-connector? Have I missed some
> > configurations/arguments?
> > > > >
> > > > > Thanks,
> > > > > Dileepa
> > > > >
> > > > > --
> > > > >
> > > > > ------------------------------
> > > > > This message should be regarded as confidential. If you have
> received
> > > > this
> > > > > email in error please notify the sender and destroy it immediately.
> > > > > Statements of intent shall only become binding when confirmed in
> hard
> > > > copy
> > > > > by an authorised signatory.
> > > > >
> > > > > Zaizi Ltd is registered in England and Wales with the registration
> > > number
> > > > > 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> > Road,
> > > > > London W6 7AN.
> > > > >
> > > >
> > >
> > > --
> > >
> > > ------------------------------
> > > This message should be regarded as confidential. If you have received
> > this
> > > email in error please notify the sender and destroy it immediately.
> > > Statements of intent shall only become binding when confirmed in hard
> > copy
> > > by an authorised signatory.
> > >
> > > Zaizi Ltd is registered in England and Wales with the registration
> number
> > > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > > London W6 7AN.
> > >
> >
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.
>

Re: Indexing Solr documents with atomic updates using manifoldcf solr connector

Posted by Dileepa Jayakody <dj...@zaizi.com>.
Hi Karl,

Thanks a lot for the detailed explanation.

I was able to get my usecase working by configuring the solr-connector to
create solrj ContentStreamUpdateRequest where it constructs the solr update
request using the RepositoryDocumentStream.

I just had to tick "Use the Extract Update Handler" option in
solr-connector's schema configuration section and use /update/json handler
to index the content in Solr.

So I could create multiple child repository documents in my transformation
connector, set the content of it to a JSON and use solrj
ContentStreamUpdateRequest to send it to Solr and index the child documents
separately using /update/json handler.

Thank you very much for all the help.

Regards,
Dileepa


On Tue, Aug 11, 2015 at 11:36 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Dileepa,
>
> The only current way for ManifoldCF to track documents that are related in
> a parent-child relationship is using the "document component" mechanism.
> This is appropriate when a repository is structured so that a single
> document being processed results in multiple documents being indexed.
> Document components are determined and managed by the repository connector,
> NOT by a transformation connector or output connector.  Each
> RepositoryDocument still represents a single document, never multiple
> documents, and even if you could create document components in a
> transformation connector, they would be tracked individually in MCF and
> indexed completely independently in Solr.
>
> So, to do what you want sounds like it would require a different approach,
> specifically the extension of RepositoryDocument to handle multiple
> atomically-related logical documents at one time.  This is something that
> would require API changes.  However, if such a thing were attempted, the
> entire set of related documents represented by a single RepositoryDocument
> would all be indexed at one time, atomically, which sounds like it also
> might not be what you want.  It sounds to me like you are still trying to
> pursue your idea of indexing individual fields independently, is that
> correct?
>
> Karl
>
>
> On Tue, Aug 11, 2015 at 12:44 AM, Dileepa Jayakody <dj...@zaizi.com>
> wrote:
>
> > Hi Karl,
> >
> > Thanks for your response. My requirement is indexing child documents
> > constructed from the content repo.document as separate Solr documents. So
> > adding meta-data fields to the original repository document wouldn't help
> > my scenario AFAIU.
> >
> > My transformation connector is somewhat similar to the Stanbol
> > transformation connector proposed in manifoldcf jira [1].
> > What I referred as meta-data are the Named Entity Recognition data (NER)
> > extracted from the repository document. So each content repository
> document
> > will have multiple NER child documents. These NERs are expected to be
> > indexed as separate Solr documents having a mapping to the parent content
> > repository document which the NERs were extracted from.
> > So apart from indexing the content repository document in Solr, I need to
> > index all NER child documents with their attributes as separate documents
> > in Solr.
> >
> > Above example is how I create a child repo document for NER. I set the
> > entire NER document as the binary stream of the child repository document
> > which is then sent to mcf-solr connector.
> >
> > In the mcf-solr connector (In HttpPoster class) when building the solr
> > document from the repository document's input stream, it adds the
> > inputStream String as a field to the content field of the Solr document
> > configured by solr-connector as below;
> >
> > buildSorDocument(long length, InputStream is){
> >
> > if (contentAttributeName != null)
> >       {
> >         Reader r = new InputStreamReader(is, Consts.UTF_8);
> >         StringBuilder sb = new StringBuilder((int)length);
> >         char[] buffer = new char[65536];
> >         while (true)
> >         {
> >           int amt = r.read(buffer,0,buffer.length);
> >           if (amt == -1)
> >             break;
> >           sb.append(buffer,0,amt);
> >         }
> >
> >         outputDoc.addField( contentAttributeName, sb.toString() );
> >       }
> > ....
> > }
> >
> > Therefore the solr-connector sends the JSON update request I constructed
> in
> > my connector as a field value of the  Solr document, not as the whole
> Solr
> > document.
> >
> > Can you please give me some advice on how to index nested child documents
> > in Solr using Manifold?
> >
> > Thanks,
> > Dileepa
> >
> > [1] https://issues.apache.org/jira/browse/CONNECTORS-1181
> >
> > On Mon, Aug 10, 2015 at 6:47 PM, Karl Wright <da...@gmail.com> wrote:
> >
> > > Hi Dileepa,
> > >
> > > In order for ManifoldCF to index metadata, you need to set metadata
> field
> > > values in the RepositoryDocument object, not send Solr JSON as the
> > > document's content.  In fact from your example it looks like you want
> > zero
> > > content.
> > >
> > > Please read the RepositoryDocument java doc to see how you set
> metadata.
> > >
> > > Karl
> > >
> > >
> > > On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody <djayakody@zaizi.com
> >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We have a requirement to extract some meta-data from content
> documents
> > > and
> > > > index those meta-data as separate documents into a Solr index.
> > > > I'm writing a transformation connector where I construct a new
> > repository
> > > > document adding the meta-data extracted by the connector and hand it
> > over
> > > > to mcf-solr-connector to index in Solr.
> > > > Currently I face some difficulties with indexing these new documents
> in
> > > > Solr properly using solr-connector.
> > > >
> > > > The new solr document should contain some atomic updates for certain
> > > > fields. So in my connector I create a JSON to represent the Solr
> atomic
> > > > update request and set is as the binaryStream of the repository
> > > > document.The json string for the new solr document is as below;
> > > >
> > > > String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
> > > > ","label":"Africa","documents":{"add":"sample2.txt"}}]";
> > > >
> > > >
> > > > Then, I add an id and set above jsonString as the binary input stream
> > of
> > > > the repo-document as follows;
> > > >
> > > > repoDoc.addField( "id", idString );
> > > > InputStream inputStream = IOUtils.toInputStream( jsonString );
> > > > repoDoc.setBinary(inputStream, jsonString.getBytes().length);
> > > >
> > > > The expected behavior is Solr connector sending the SolrInputDocument
> > > > constructed from the inputStream I added to the repo-document from my
> > > > connector. But instead it adds the JSON  string to the  'content'
> field
> > > of
> > > > the solr-document and sends to Solr.
> > > >
> > > > When I monitored the HTTP request from manifold to Solr I see below;
> > > >
> > > > POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
> > > > <add>
> > > >    <doc boost="1.0">
> > > >       <field name="id">http://dbpedia.org/resource/Africa</field>
> > > >       <field name="_root_">[{"id":"
> http://dbpedia.org/resource/Africa
> > > > ","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
> > > >       <field name="lcf_metadata_id">
> http://dbpedia.org/resource/Africa
> > > > </field>
> > > >    </doc></add>0
> > > >
> > > > Please note that the 'content' field configured in manifoldcf is
> > > *_root_*.
> > > >
> > > > But the expected Solr update request from solr-connector should be as
> > > > below;
> > > > <add>
> > > >    <doc boost="1.0">
> > > >     <field name="id">http://dbpedia.org/resource/Africa</field>
> > > >      <field name="label">Africa</field>
> > > >       <field name="documents" update="add">sample2.txt</field>
> > > >      <field name="lcf_metadata_id">
> http://dbpedia.org/resource/Africa
> > > > </field>
> > > >    </doc></add>0
> > > >
> > > >
> > > > Can someone please give some advice on how to use solr atomic updates
> > > with
> > > > manifoldcf solr-connector? Have I missed some
> configurations/arguments?
> > > >
> > > > Thanks,
> > > > Dileepa
> > > >
> > > > --
> > > >
> > > > ------------------------------
> > > > This message should be regarded as confidential. If you have received
> > > this
> > > > email in error please notify the sender and destroy it immediately.
> > > > Statements of intent shall only become binding when confirmed in hard
> > > copy
> > > > by an authorised signatory.
> > > >
> > > > Zaizi Ltd is registered in England and Wales with the registration
> > number
> > > > 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> Road,
> > > > London W6 7AN.
> > > >
> > >
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
> >
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: Indexing Solr documents with atomic updates using manifoldcf solr connector

Posted by Karl Wright <da...@gmail.com>.
Hi Dileepa,

The only current way for ManifoldCF to track documents that are related in
a parent-child relationship is using the "document component" mechanism.
This is appropriate when a repository is structured so that a single
document being processed results in multiple documents being indexed.
Document components are determined and managed by the repository connector,
NOT by a transformation connector or output connector.  Each
RepositoryDocument still represents a single document, never multiple
documents, and even if you could create document components in a
transformation connector, they would be tracked individually in MCF and
indexed completely independently in Solr.

So, to do what you want sounds like it would require a different approach,
specifically the extension of RepositoryDocument to handle multiple
atomically-related logical documents at one time.  This is something that
would require API changes.  However, if such a thing were attempted, the
entire set of related documents represented by a single RepositoryDocument
would all be indexed at one time, atomically, which sounds like it also
might not be what you want.  It sounds to me like you are still trying to
pursue your idea of indexing individual fields independently, is that
correct?

Karl


On Tue, Aug 11, 2015 at 12:44 AM, Dileepa Jayakody <dj...@zaizi.com>
wrote:

> Hi Karl,
>
> Thanks for your response. My requirement is indexing child documents
> constructed from the content repo.document as separate Solr documents. So
> adding meta-data fields to the original repository document wouldn't help
> my scenario AFAIU.
>
> My transformation connector is somewhat similar to the Stanbol
> transformation connector proposed in manifoldcf jira [1].
> What I referred as meta-data are the Named Entity Recognition data (NER)
> extracted from the repository document. So each content repository document
> will have multiple NER child documents. These NERs are expected to be
> indexed as separate Solr documents having a mapping to the parent content
> repository document which the NERs were extracted from.
> So apart from indexing the content repository document in Solr, I need to
> index all NER child documents with their attributes as separate documents
> in Solr.
>
> Above example is how I create a child repo document for NER. I set the
> entire NER document as the binary stream of the child repository document
> which is then sent to mcf-solr connector.
>
> In the mcf-solr connector (In HttpPoster class) when building the solr
> document from the repository document's input stream, it adds the
> inputStream String as a field to the content field of the Solr document
> configured by solr-connector as below;
>
> buildSorDocument(long length, InputStream is){
>
> if (contentAttributeName != null)
>       {
>         Reader r = new InputStreamReader(is, Consts.UTF_8);
>         StringBuilder sb = new StringBuilder((int)length);
>         char[] buffer = new char[65536];
>         while (true)
>         {
>           int amt = r.read(buffer,0,buffer.length);
>           if (amt == -1)
>             break;
>           sb.append(buffer,0,amt);
>         }
>
>         outputDoc.addField( contentAttributeName, sb.toString() );
>       }
> ....
> }
>
> Therefore the solr-connector sends the JSON update request I constructed in
> my connector as a field value of the  Solr document, not as the whole Solr
> document.
>
> Can you please give me some advice on how to index nested child documents
> in Solr using Manifold?
>
> Thanks,
> Dileepa
>
> [1] https://issues.apache.org/jira/browse/CONNECTORS-1181
>
> On Mon, Aug 10, 2015 at 6:47 PM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Dileepa,
> >
> > In order for ManifoldCF to index metadata, you need to set metadata field
> > values in the RepositoryDocument object, not send Solr JSON as the
> > document's content.  In fact from your example it looks like you want
> zero
> > content.
> >
> > Please read the RepositoryDocument java doc to see how you set metadata.
> >
> > Karl
> >
> >
> > On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody <dj...@zaizi.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > We have a requirement to extract some meta-data from content documents
> > and
> > > index those meta-data as separate documents into a Solr index.
> > > I'm writing a transformation connector where I construct a new
> repository
> > > document adding the meta-data extracted by the connector and hand it
> over
> > > to mcf-solr-connector to index in Solr.
> > > Currently I face some difficulties with indexing these new documents in
> > > Solr properly using solr-connector.
> > >
> > > The new solr document should contain some atomic updates for certain
> > > fields. So in my connector I create a JSON to represent the Solr atomic
> > > update request and set is as the binaryStream of the repository
> > > document.The json string for the new solr document is as below;
> > >
> > > String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
> > > ","label":"Africa","documents":{"add":"sample2.txt"}}]";
> > >
> > >
> > > Then, I add an id and set above jsonString as the binary input stream
> of
> > > the repo-document as follows;
> > >
> > > repoDoc.addField( "id", idString );
> > > InputStream inputStream = IOUtils.toInputStream( jsonString );
> > > repoDoc.setBinary(inputStream, jsonString.getBytes().length);
> > >
> > > The expected behavior is Solr connector sending the SolrInputDocument
> > > constructed from the inputStream I added to the repo-document from my
> > > connector. But instead it adds the JSON  string to the  'content' field
> > of
> > > the solr-document and sends to Solr.
> > >
> > > When I monitored the HTTP request from manifold to Solr I see below;
> > >
> > > POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
> > > <add>
> > >    <doc boost="1.0">
> > >       <field name="id">http://dbpedia.org/resource/Africa</field>
> > >       <field name="_root_">[{"id":"http://dbpedia.org/resource/Africa
> > > ","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
> > >       <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> > > </field>
> > >    </doc></add>0
> > >
> > > Please note that the 'content' field configured in manifoldcf is
> > *_root_*.
> > >
> > > But the expected Solr update request from solr-connector should be as
> > > below;
> > > <add>
> > >    <doc boost="1.0">
> > >     <field name="id">http://dbpedia.org/resource/Africa</field>
> > >      <field name="label">Africa</field>
> > >       <field name="documents" update="add">sample2.txt</field>
> > >      <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> > > </field>
> > >    </doc></add>0
> > >
> > >
> > > Can someone please give some advice on how to use solr atomic updates
> > with
> > > manifoldcf solr-connector? Have I missed some configurations/arguments?
> > >
> > > Thanks,
> > > Dileepa
> > >
> > > --
> > >
> > > ------------------------------
> > > This message should be regarded as confidential. If you have received
> > this
> > > email in error please notify the sender and destroy it immediately.
> > > Statements of intent shall only become binding when confirmed in hard
> > copy
> > > by an authorised signatory.
> > >
> > > Zaizi Ltd is registered in England and Wales with the registration
> number
> > > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > > London W6 7AN.
> > >
> >
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.
>

Re: Indexing Solr documents with atomic updates using manifoldcf solr connector

Posted by Dileepa Jayakody <dj...@zaizi.com>.
Hi Karl,

Thanks for your response. My requirement is indexing child documents
constructed from the content repo.document as separate Solr documents. So
adding meta-data fields to the original repository document wouldn't help
my scenario AFAIU.

My transformation connector is somewhat similar to the Stanbol
transformation connector proposed in manifoldcf jira [1].
What I referred as meta-data are the Named Entity Recognition data (NER)
extracted from the repository document. So each content repository document
will have multiple NER child documents. These NERs are expected to be
indexed as separate Solr documents having a mapping to the parent content
repository document which the NERs were extracted from.
So apart from indexing the content repository document in Solr, I need to
index all NER child documents with their attributes as separate documents
in Solr.

Above example is how I create a child repo document for NER. I set the
entire NER document as the binary stream of the child repository document
which is then sent to mcf-solr connector.

In the mcf-solr connector (In HttpPoster class) when building the solr
document from the repository document's input stream, it adds the
inputStream String as a field to the content field of the Solr document
configured by solr-connector as below;

buildSorDocument(long length, InputStream is){

if (contentAttributeName != null)
      {
        Reader r = new InputStreamReader(is, Consts.UTF_8);
        StringBuilder sb = new StringBuilder((int)length);
        char[] buffer = new char[65536];
        while (true)
        {
          int amt = r.read(buffer,0,buffer.length);
          if (amt == -1)
            break;
          sb.append(buffer,0,amt);
        }

        outputDoc.addField( contentAttributeName, sb.toString() );
      }
....
}

Therefore the solr-connector sends the JSON update request I constructed in
my connector as a field value of the  Solr document, not as the whole Solr
document.

Can you please give me some advice on how to index nested child documents
in Solr using Manifold?

Thanks,
Dileepa

[1] https://issues.apache.org/jira/browse/CONNECTORS-1181

On Mon, Aug 10, 2015 at 6:47 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Dileepa,
>
> In order for ManifoldCF to index metadata, you need to set metadata field
> values in the RepositoryDocument object, not send Solr JSON as the
> document's content.  In fact from your example it looks like you want zero
> content.
>
> Please read the RepositoryDocument java doc to see how you set metadata.
>
> Karl
>
>
> On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody <dj...@zaizi.com>
> wrote:
>
> > Hi All,
> >
> > We have a requirement to extract some meta-data from content documents
> and
> > index those meta-data as separate documents into a Solr index.
> > I'm writing a transformation connector where I construct a new repository
> > document adding the meta-data extracted by the connector and hand it over
> > to mcf-solr-connector to index in Solr.
> > Currently I face some difficulties with indexing these new documents in
> > Solr properly using solr-connector.
> >
> > The new solr document should contain some atomic updates for certain
> > fields. So in my connector I create a JSON to represent the Solr atomic
> > update request and set is as the binaryStream of the repository
> > document.The json string for the new solr document is as below;
> >
> > String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
> > ","label":"Africa","documents":{"add":"sample2.txt"}}]";
> >
> >
> > Then, I add an id and set above jsonString as the binary input stream of
> > the repo-document as follows;
> >
> > repoDoc.addField( "id", idString );
> > InputStream inputStream = IOUtils.toInputStream( jsonString );
> > repoDoc.setBinary(inputStream, jsonString.getBytes().length);
> >
> > The expected behavior is Solr connector sending the SolrInputDocument
> > constructed from the inputStream I added to the repo-document from my
> > connector. But instead it adds the JSON  string to the  'content' field
> of
> > the solr-document and sends to Solr.
> >
> > When I monitored the HTTP request from manifold to Solr I see below;
> >
> > POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
> > <add>
> >    <doc boost="1.0">
> >       <field name="id">http://dbpedia.org/resource/Africa</field>
> >       <field name="_root_">[{"id":"http://dbpedia.org/resource/Africa
> > ","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
> >       <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> > </field>
> >    </doc></add>0
> >
> > Please note that the 'content' field configured in manifoldcf is
> *_root_*.
> >
> > But the expected Solr update request from solr-connector should be as
> > below;
> > <add>
> >    <doc boost="1.0">
> >     <field name="id">http://dbpedia.org/resource/Africa</field>
> >      <field name="label">Africa</field>
> >       <field name="documents" update="add">sample2.txt</field>
> >      <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> > </field>
> >    </doc></add>0
> >
> >
> > Can someone please give some advice on how to use solr atomic updates
> with
> > manifoldcf solr-connector? Have I missed some configurations/arguments?
> >
> > Thanks,
> > Dileepa
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
> >
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: Indexing Solr documents with atomic updates using manifoldcf solr connector

Posted by Karl Wright <da...@gmail.com>.
Hi Dileepa,

In order for ManifoldCF to index metadata, you need to set metadata field
values in the RepositoryDocument object, not send Solr JSON as the
document's content.  In fact from your example it looks like you want zero
content.

Please read the RepositoryDocument java doc to see how you set metadata.

Karl


On Mon, Aug 10, 2015 at 9:05 AM, Dileepa Jayakody <dj...@zaizi.com>
wrote:

> Hi All,
>
> We have a requirement to extract some meta-data from content documents and
> index those meta-data as separate documents into a Solr index.
> I'm writing a transformation connector where I construct a new repository
> document adding the meta-data extracted by the connector and hand it over
> to mcf-solr-connector to index in Solr.
> Currently I face some difficulties with indexing these new documents in
> Solr properly using solr-connector.
>
> The new solr document should contain some atomic updates for certain
> fields. So in my connector I create a JSON to represent the Solr atomic
> update request and set is as the binaryStream of the repository
> document.The json string for the new solr document is as below;
>
> String jsonString = "[{"id":"http://dbpedia.org/resource/Africa
> ","label":"Africa","documents":{"add":"sample2.txt"}}]";
>
>
> Then, I add an id and set above jsonString as the binary input stream of
> the repo-document as follows;
>
> repoDoc.addField( "id", idString );
> InputStream inputStream = IOUtils.toInputStream( jsonString );
> repoDoc.setBinary(inputStream, jsonString.getBytes().length);
>
> The expected behavior is Solr connector sending the SolrInputDocument
> constructed from the inputStream I added to the repo-document from my
> connector. But instead it adds the JSON  string to the  'content' field of
> the solr-document and sends to Solr.
>
> When I monitored the HTTP request from manifold to Solr I see below;
>
> POST /solr/core1/update?wt=xml&version=2.2 HTTP/1.1
> <add>
>    <doc boost="1.0">
>       <field name="id">http://dbpedia.org/resource/Africa</field>
>       <field name="_root_">[{"id":"http://dbpedia.org/resource/Africa
> ","label":"Africa","documents":{"add":"sample2.txt"}}]</field>
>       <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> </field>
>    </doc></add>0
>
> Please note that the 'content' field configured in manifoldcf is *_root_*.
>
> But the expected Solr update request from solr-connector should be as
> below;
> <add>
>    <doc boost="1.0">
>     <field name="id">http://dbpedia.org/resource/Africa</field>
>      <field name="label">Africa</field>
>       <field name="documents" update="add">sample2.txt</field>
>      <field name="lcf_metadata_id">http://dbpedia.org/resource/Africa
> </field>
>    </doc></add>0
>
>
> Can someone please give some advice on how to use solr atomic updates with
> manifoldcf solr-connector? Have I missed some configurations/arguments?
>
> Thanks,
> Dileepa
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.
>