You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Anupam Bhattacharya <an...@gmail.com> on 2012/03/27 19:08:36 UTC

Running 2 jobs to update same document Index but different fields

I want to configure two jobs to index in SOLR using ManifoldCF using
/extract/update requestHandler.
1st to synchronize only XML files & 2nd to synchronize the PDF file.
If both these document share a unique id. Can i combine the indexes for
both in 1 SOLR schema without overriding the details added by previous job.

suppose,
      xmldoc indexes field0(id), field1, field2, field3
&    pdfdoc indexes field0(id), field4, field5, field6.

Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
field4, field5, field6

Regards
Anupam

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

I ran the 1 job with XML and PDF doc type together only and again i lost
all the indexes when the job got finished.

The Rejected results was for 1 document which might have confused you. The
fetching is working for many documents with SUCCESS status.

When the job is running i was able to see the indexes for PDF and XML both
from SOLR admin. But the moment it got finished all indexes were gone.

Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp>Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
IdentifierResult Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp>Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result
Description
03-29-2012 14:41:29.053document deletion (Solr_Test_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78004f63f
2000103-29-2012 14:33:40.741document ingest (Solr_Test_QA)http://
example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78004f63f
200149115603-29-2012 14:33:38.758fetch09d905e78004f63f
Success14911780

On Thu, Mar 29, 2012 at 2:01 PM, Karl Wright <da...@gmail.com> wrote:

>
> The "REJECTED" result is because the document has the wrong mime type or
> is too long, according to your length restriction.  Do you have just one
> job, or do you still have two?  If you have two jobs covering the same
> overall documents with different document criteria, this is the kind of
> thing that happens when you run one job after the other; the documents
> belonging to the first.  You will only need one job if you try the plan I
> was talking about, but it should include the PDFs as well as the XML
> documents.
>
> If you only have one job, then I can't explain it unless you changed the
> document criteria and ran the job a second time.
>
> Karl
>
>
>
>
> On Thu, Mar 29, 2012 at 3:39 AM, Anupam Bhattacharya <an...@gmail.com>wrote:
>
>> Okay. I tried to use the id which is formed my manifoldcf documentum
>> connector. I ran the job i could see in between from the SOLR admin screen
>> that documents were getting indexed. But just after the end of the job i
>> see all my created indexes gets deleted.
>>
>> Snippet from Simple History is given below.
>>
>> Why this document deletion activity gets added and deletes all my created
>> indexes when i keep the unique id as "id" in the schema.xml file of SOLR ?
>>
>>  Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp> Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
>> Identifier Result Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
>> Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp> Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result Description
>> 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
>> http://example.domain.com:8088/webtop/component/drl?versio...
>> nLabel=CURRENT&objectId=09d905e78000676d
>> 200 0 110
>> 03-29-2012 12:55:37.869 fetch 09d905e78000676d
>> REJECTED 86823 4184
>> 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
>> http://example.domain.com:8088/webtop/component/drl?versio...
>> nLabel=CURRENT&objectId=09d905e78000676d
>> 200 8158 235
>>
>> On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> "So do you find this design appropriate and feasible ?"  It sounds
>>> like you are still trying to merge records in Solr but this time using
>>> Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
>>> think you will find it easy to keep data from one job aligned with
>>> data from another.  That's why I suggested just allowing both kinds of
>>> documents to be indexed as-is, and just making sure that you include a
>>> metadata reference to the main document in each.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
>>> <an...@gmail.com> wrote:
>>> > The second option seems to be more useful as it will allow me add to
>>> any
>>> > business logic.
>>> > So similar to SOLR Cell (/update/extract) my new RequestHandler will be
>>> > added in solrconfig.xml which will do all the manipulations.
>>> > Later, I need to get all field values into a temp variable by first
>>> > searching by id in the lucene indexes and then add these values into
>>> the
>>> > incoming new field values list.
>>> >
>>> > So do you find this design appropriate and feasible ?
>>> >
>>> > Anupam
>>> >
>>> > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <da...@gmail.com>
>>> wrote:
>>> >>
>>> >> Thanks - now I understand what you are trying to do more clearly.
>>> >>
>>> >> The Documentum connector is going to pick up the XML document and the
>>> >> PDF document as separate entities.  Thus, they'd also be indexed in
>>> >> Solr separately.  So if we use that as a starting point, let's see
>>> >> where it might lead.
>>> >>
>>> >> First, you'd want each PDF document to have metadata that refers back
>>> >> to the XML parent document.  I'm not sure how easy it is to set up
>>> >> such a metadata reference in Documentum, but I vaguely recall there
>>> >> was indeed some such field.  So let's presume you can get that.  Then,
>>> >> you'd want to make sure your Solr schema included an "XML document"
>>> >> field, which had the URL of the parent XML document (or, for XML
>>> >> documents, the document's own URL) as content.  That would be the ID
>>> >> you'd use to present a result item to a user.
>>> >>
>>> >> Does this sound reasonable so far?
>>> >>
>>> >> The only other piece you might need is manipulation of either the
>>> >> PDF's metadata, or the XML document's metadata, or both.  For that,
>>> >> I'd use Solr Cell to perform whatever mappings and manipulations made
>>> >> sense before the documents actually get indexed.
>>> >>
>>> >> Karl
>>> >>
>>> >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
>>> >> <an...@gmail.com> wrote:
>>> >> > I would have been happy if  I had to index PDF and XML separately.
>>> >> > But for my use-case. XML is the main document containing
>>> bibliographic
>>> >> > information (which needs to presented as search result) and
>>> consists a
>>> >> > reference to a child/supporting document which is a actual PDF
>>> file. I
>>> >> > need
>>> >> > to index the PDF text and if any search matches with the PDF content
>>> >> > then
>>> >> > the parent/XML bibliographic information needs to be presented.
>>> >> >
>>> >> > I am trying to call the SOLR search engine with one single query to
>>> show
>>> >> > corresponding XML detail for a search term present in the PDF. I
>>> checked
>>> >> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
>>> >> > (http://wiki.apache.org/solr/Join) but work like inner query.
>>> >> >
>>> >> > Again the main requirement is that the PDF should be searchable but
>>> it
>>> >> > master details from XML should only be presented to request the
>>> actual
>>> >> > PDF.
>>> >> >
>>> >> > -Anupam
>>> >> >
>>> >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> This doesn't sound like a problem a connector can solve.  The
>>> problem
>>> >> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
>>> >> >> wrong document key and Lucene does not let you modify a document
>>> index
>>> >> >> once it is created, and no matter what you do to ManifoldCF it
>>> can't
>>> >> >> get around that restriction.  So it sounds like you need to
>>> >> >> fundamentally rethink your design.
>>> >> >>
>>> >> >> If all you want to do is index XML and PDF as separate documents,
>>> just
>>> >> >> change your Solr output connection specification to change the
>>> >> >> selected "id" field appropriately.  Then, BOTH documents will be
>>> >> >> indexed by Solr, each with different metadata as you originally
>>> >> >> specified.  I'm frankly having a really hard time seeing why this
>>> is
>>> >> >> so hard.
>>> >> >>
>>> >> >> Karl
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>>> >> >> <an...@gmail.com> wrote:
>>> >> >> > Should I write a new Documentum Connector with our specific
>>> use-case
>>> >> >> > to
>>> >> >> > go
>>> >> >> > forward ?
>>> >> >> > I guess your book will be helpful to understand connector
>>> framework
>>> >> >> > in
>>> >> >> > manifoldcf.
>>> >> >> >
>>> >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com
>>> >
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> Right, LUCENE never did allow you to modify a document's
>>> indexes,
>>> >> >> >> only
>>> >> >> >> replace them.  What I'm trying to tell you is that there is no
>>> >> >> >> reason
>>> >> >> >> to have the same document ID for both documents.  ManifoldCF
>>> will
>>> >> >> >> support treating the XML document and PDF document as different
>>> >> >> >> documents in Solr.  But if you want them to in fact be the same
>>> >> >> >> document, just combined in some way, neither ManifoldCF nor
>>> Lucene
>>> >> >> >> will support that at this time.
>>> >> >> >>
>>> >> >> >> Karl
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>>> >> >> >> <an...@gmail.com> wrote:
>>> >> >> >> > I saw that the index getting created by 1st PDF indexing job
>>> which
>>> >> >> >> > worked
>>> >> >> >> > perfectly well for a particular id. Later when i ran the 2nd
>>> XML
>>> >> >> >> > indexing
>>> >> >> >> > Job for the same id. I lost all field indexed by the 1st job
>>> and i
>>> >> >> >> > was
>>> >> >> >> > left
>>> >> >> >> > out with field indexes created my this 2nd job.
>>> >> >> >> >
>>> >> >> >> > I thought that it would combine field values for a specified
>>> doc
>>> >> >> >> > id.
>>> >> >> >> >
>>> >> >> >> > As per Lucene developers they mention that by design Lucene
>>> >> >> >> > doesn't
>>> >> >> >> > support
>>> >> >> >> > this.
>>> >> >> >> >
>>> >> >> >> > Pls. see following url ::
>>> >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>>> >> >> >> >
>>> >> >> >> > -Anupam
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <
>>> daddywri@gmail.com>
>>> >> >> >> > wrote:
>>> >> >> >> >>
>>> >> >> >> >> The Solr handler that you are using should not matter here.
>>> >> >> >> >>
>>> >> >> >> >> Can you look at the Simple History report, and do the
>>> following:
>>> >> >> >> >>
>>> >> >> >> >> - Look for a document that is being indexed in both PDF and
>>> XML.
>>> >> >> >> >> - Find the "ingestion" activity for that document for both
>>> PDF
>>> >> >> >> >> and
>>> >> >> >> >> XML
>>> >> >> >> >> - Compare the ID's (which for the ingestion activity are the
>>> >> >> >> >> URL's
>>> >> >> >> >> of
>>> >> >> >> >> the documents in Webtop)
>>> >> >> >> >>
>>> >> >> >> >> If the URLs are in fact different, then you should be able to
>>> >> >> >> >> make
>>> >> >> >> >> this work.  You need to look at how you configured your Solr
>>> >> >> >> >> instance,
>>> >> >> >> >> and which fields you are specifying in your Solr output
>>> >> >> >> >> connection.
>>> >> >> >> >> You want those Webtop urls to be indexed as the unique
>>> document
>>> >> >> >> >> identifier in Solr, not some other ID.
>>> >> >> >> >>
>>> >> >> >> >> Thanks,
>>> >> >> >> >> Karl
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>>> >> >> >> >> <an...@gmail.com> wrote:
>>> >> >> >> >> > Today I ran 2 job one by one but it seems since we are
>>> using
>>> >> >> >> >> > /update/extract Request Handler the field values for
>>> common id
>>> >> >> >> >> > gets
>>> >> >> >> >> > overridden by the latest job. I want to update certain
>>> field in
>>> >> >> >> >> > the
>>> >> >> >> >> > lucene indexes for the doc rather than completely update
>>> with
>>> >> >> >> >> > new
>>> >> >> >> >> > values and by loosing other field value entries.
>>> >> >> >> >> >
>>> >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
>>> >> >> >> >> > <da...@gmail.com>
>>> >> >> >> >> > wrote:
>>> >> >> >> >> >> For Documentum, content length is in bytes, I believe.  It
>>> >> >> >> >> >> does
>>> >> >> >> >> >> not
>>> >> >> >> >> >> set the length, it filters out all documents greater than
>>> the
>>> >> >> >> >> >> specified length.  Leaving the field blank will perform no
>>> >> >> >> >> >> filtering.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Document types in Documentum are specified by mime type,
>>> so
>>> >> >> >> >> >> you'd
>>> >> >> >> >> >> want
>>> >> >> >> >> >> to select all that apply.  The actual one used will
>>> depend on
>>> >> >> >> >> >> how
>>> >> >> >> >> >> your
>>> >> >> >> >> >> particular instance of Documentum is configured, but if
>>> you
>>> >> >> >> >> >> pick
>>> >> >> >> >> >> them
>>> >> >> >> >> >> all you should have no problem.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Karl
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>>> >> >> >> >> >> <an...@gmail.com> wrote:
>>> >> >> >> >> >>> Thanks!! Seems from your explanation that i can update
>>> same
>>> >> >> >> >> >>> documents
>>> >> >> >> >> >>> other
>>> >> >> >> >> >>> field values. I inquired about this because I have two
>>> >> >> >> >> >>> different
>>> >> >> >> >> >>> document
>>> >> >> >> >> >>> with a parent-child relationship which needs to be
>>> indexed as
>>> >> >> >> >> >>> one
>>> >> >> >> >> >>> document
>>> >> >> >> >> >>> in lucene index.
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> As you must have understood by now that i am trying to do
>>> >> >> >> >> >>> this
>>> >> >> >> >> >>> for
>>> >> >> >> >> >>> Documentum CMS. I have seen the configuration screen for
>>> >> >> >> >> >>> setting
>>> >> >> >> >> >>> the
>>> >> >> >> >> >>> Content
>>> >> >> >> >> >>> length & second for filtering document type. So my
>>> question
>>> >> >> >> >> >>> is
>>> >> >> >> >> >>> what
>>> >> >> >> >> >>> unit the
>>> >> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) &
>>> whether
>>> >> >> >> >> >>> this
>>> >> >> >> >> >>> configuration set the lengths for documents full text
>>> >> >> >> >> >>> indexing
>>> >> >> >> >> >>> ?.
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> Additionally to scan only one kind of document e.g PDF
>>> what
>>> >> >> >> >> >>> should
>>> >> >> >> >> >>> be
>>> >> >> >> >> >>> added
>>> >> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> Regards
>>> >> >> >> >> >>> Anupam
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
>>> >> >> >> >> >>> <da...@gmail.com>
>>> >> >> >> >> >>> wrote:
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> The document key in Solr is the url of the document, as
>>> >> >> >> >> >>>> constructed
>>> >> >> >> >> >>>> by
>>> >> >> >> >> >>>> the connector you are using.  If you are using the same
>>> >> >> >> >> >>>> document
>>> >> >> >> >> >>>> to
>>> >> >> >> >> >>>> construct two different Solr documents, ManifoldCF by
>>> >> >> >> >> >>>> definition
>>> >> >> >> >> >>>> cannot be aware of this.  But if these are different
>>> files
>>> >> >> >> >> >>>> from
>>> >> >> >> >> >>>> the
>>> >> >> >> >> >>>> point of view of ManifoldCF they will have different
>>> URLs
>>> >> >> >> >> >>>> and
>>> >> >> >> >> >>>> be
>>> >> >> >> >> >>>> treated differently.  The jobs can overlap in this case
>>> with
>>> >> >> >> >> >>>> no
>>> >> >> >> >> >>>> difficulty.
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> Karl
>>> >> >> >> >> >>>>
>>> >> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>>> >> >> >> >> >>>> <an...@gmail.com> wrote:
>>> >> >> >> >> >>>> > I want to configure two jobs to index in SOLR using
>>> >> >> >> >> >>>> > ManifoldCF
>>> >> >> >> >> >>>> > using
>>> >> >> >> >> >>>> > /extract/update requestHandler.
>>> >> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to
>>> synchronize the
>>> >> >> >> >> >>>> > PDF
>>> >> >> >> >> >>>> > file.
>>> >> >> >> >> >>>> > If both these document share a unique id. Can i
>>> combine
>>> >> >> >> >> >>>> > the
>>> >> >> >> >> >>>> > indexes
>>> >> >> >> >> >>>> > for
>>> >> >> >> >> >>>> > both
>>> >> >> >> >> >>>> > in 1 SOLR schema without overriding the details added
>>> by
>>> >> >> >> >> >>>> > previous
>>> >> >> >> >> >>>> > job.
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > suppose,
>>> >> >> >> >> >>>> >       xmldoc indexes field0(id), field1, field2,
>>> field3
>>> >> >> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5,
>>> field6.
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
>>> >> >> >> >> >>>> > field2,
>>> >> >> >> >> >>>> > field3,
>>> >> >> >> >> >>>> > field4, field5, field6
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> > Regards
>>> >> >> >> >> >>>> > Anupam
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>> >
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >>>
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > --
>>> >> >> >> >> > Thanks & Regards
>>> >> >> >> >> > Anupam Bhattacharya
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Thanks & Regards
>>> >> >> >> > Anupam Bhattacharya
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Thanks & Regards
>>> >> >> > Anupam Bhattacharya
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Thanks & Regards
>>> >> > Anupam Bhattacharya
>>> >> >
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Anupam Bhattacharya
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Anupam Bhattacharya
>>
>>
>>
>


-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

The "REJECTED" result is because the document has the wrong mime type or is
too long, according to your length restriction.  Do you have just one job,
or do you still have two?  If you have two jobs covering the same overall
documents with different document criteria, this is the kind of thing that
happens when you run one job after the other; the documents belonging to
the first.  You will only need one job if you try the plan I was talking
about, but it should include the PDFs as well as the XML documents.

If you only have one job, then I can't explain it unless you changed the
document criteria and ran the job a second time.

Karl



On Thu, Mar 29, 2012 at 3:39 AM, Anupam Bhattacharya <an...@gmail.com>wrote:

> Okay. I tried to use the id which is formed my manifoldcf documentum
> connector. I ran the job i could see in between from the SOLR admin screen
> that documents were getting indexed. But just after the end of the job i
> see all my created indexes gets deleted.
>
> Snippet from Simple History is given below.
>
> Why this document deletion activity gets added and deletes all my created
> indexes when i keep the unique id as "id" in the schema.xml file of SOLR ?
>
>  Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp> Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
> Identifier Result Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
> Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp> Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result Description
> 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
> http://example.domain.com:8088/webtop/component/drl?versio...
> nLabel=CURRENT&objectId=09d905e78000676d
> 200 0 110
> 03-29-2012 12:55:37.869 fetch 09d905e78000676d
> REJECTED 86823 4184
> 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
> http://example.domain.com:8088/webtop/component/drl?versio...
> nLabel=CURRENT&objectId=09d905e78000676d
> 200 8158 235
>
> On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <da...@gmail.com> wrote:
>
>> "So do you find this design appropriate and feasible ?"  It sounds
>> like you are still trying to merge records in Solr but this time using
>> Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
>> think you will find it easy to keep data from one job aligned with
>> data from another.  That's why I suggested just allowing both kinds of
>> documents to be indexed as-is, and just making sure that you include a
>> metadata reference to the main document in each.
>>
>> Karl
>>
>>
>> On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > The second option seems to be more useful as it will allow me add to any
>> > business logic.
>> > So similar to SOLR Cell (/update/extract) my new RequestHandler will be
>> > added in solrconfig.xml which will do all the manipulations.
>> > Later, I need to get all field values into a temp variable by first
>> > searching by id in the lucene indexes and then add these values into the
>> > incoming new field values list.
>> >
>> > So do you find this design appropriate and feasible ?
>> >
>> > Anupam
>> >
>> > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <da...@gmail.com>
>> wrote:
>> >>
>> >> Thanks - now I understand what you are trying to do more clearly.
>> >>
>> >> The Documentum connector is going to pick up the XML document and the
>> >> PDF document as separate entities.  Thus, they'd also be indexed in
>> >> Solr separately.  So if we use that as a starting point, let's see
>> >> where it might lead.
>> >>
>> >> First, you'd want each PDF document to have metadata that refers back
>> >> to the XML parent document.  I'm not sure how easy it is to set up
>> >> such a metadata reference in Documentum, but I vaguely recall there
>> >> was indeed some such field.  So let's presume you can get that.  Then,
>> >> you'd want to make sure your Solr schema included an "XML document"
>> >> field, which had the URL of the parent XML document (or, for XML
>> >> documents, the document's own URL) as content.  That would be the ID
>> >> you'd use to present a result item to a user.
>> >>
>> >> Does this sound reasonable so far?
>> >>
>> >> The only other piece you might need is manipulation of either the
>> >> PDF's metadata, or the XML document's metadata, or both.  For that,
>> >> I'd use Solr Cell to perform whatever mappings and manipulations made
>> >> sense before the documents actually get indexed.
>> >>
>> >> Karl
>> >>
>> >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
>> >> <an...@gmail.com> wrote:
>> >> > I would have been happy if  I had to index PDF and XML separately.
>> >> > But for my use-case. XML is the main document containing
>> bibliographic
>> >> > information (which needs to presented as search result) and consists
>> a
>> >> > reference to a child/supporting document which is a actual PDF file.
>> I
>> >> > need
>> >> > to index the PDF text and if any search matches with the PDF content
>> >> > then
>> >> > the parent/XML bibliographic information needs to be presented.
>> >> >
>> >> > I am trying to call the SOLR search engine with one single query to
>> show
>> >> > corresponding XML detail for a search term present in the PDF. I
>> checked
>> >> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
>> >> > (http://wiki.apache.org/solr/Join) but work like inner query.
>> >> >
>> >> > Again the main requirement is that the PDF should be searchable but
>> it
>> >> > master details from XML should only be presented to request the
>> actual
>> >> > PDF.
>> >> >
>> >> > -Anupam
>> >> >
>> >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> This doesn't sound like a problem a connector can solve.  The
>> problem
>> >> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
>> >> >> wrong document key and Lucene does not let you modify a document
>> index
>> >> >> once it is created, and no matter what you do to ManifoldCF it can't
>> >> >> get around that restriction.  So it sounds like you need to
>> >> >> fundamentally rethink your design.
>> >> >>
>> >> >> If all you want to do is index XML and PDF as separate documents,
>> just
>> >> >> change your Solr output connection specification to change the
>> >> >> selected "id" field appropriately.  Then, BOTH documents will be
>> >> >> indexed by Solr, each with different metadata as you originally
>> >> >> specified.  I'm frankly having a really hard time seeing why this is
>> >> >> so hard.
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>> >> >> <an...@gmail.com> wrote:
>> >> >> > Should I write a new Documentum Connector with our specific
>> use-case
>> >> >> > to
>> >> >> > go
>> >> >> > forward ?
>> >> >> > I guess your book will be helpful to understand connector
>> framework
>> >> >> > in
>> >> >> > manifoldcf.
>> >> >> >
>> >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Right, LUCENE never did allow you to modify a document's indexes,
>> >> >> >> only
>> >> >> >> replace them.  What I'm trying to tell you is that there is no
>> >> >> >> reason
>> >> >> >> to have the same document ID for both documents.  ManifoldCF will
>> >> >> >> support treating the XML document and PDF document as different
>> >> >> >> documents in Solr.  But if you want them to in fact be the same
>> >> >> >> document, just combined in some way, neither ManifoldCF nor
>> Lucene
>> >> >> >> will support that at this time.
>> >> >> >>
>> >> >> >> Karl
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>> >> >> >> <an...@gmail.com> wrote:
>> >> >> >> > I saw that the index getting created by 1st PDF indexing job
>> which
>> >> >> >> > worked
>> >> >> >> > perfectly well for a particular id. Later when i ran the 2nd
>> XML
>> >> >> >> > indexing
>> >> >> >> > Job for the same id. I lost all field indexed by the 1st job
>> and i
>> >> >> >> > was
>> >> >> >> > left
>> >> >> >> > out with field indexes created my this 2nd job.
>> >> >> >> >
>> >> >> >> > I thought that it would combine field values for a specified
>> doc
>> >> >> >> > id.
>> >> >> >> >
>> >> >> >> > As per Lucene developers they mention that by design Lucene
>> >> >> >> > doesn't
>> >> >> >> > support
>> >> >> >> > this.
>> >> >> >> >
>> >> >> >> > Pls. see following url ::
>> >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>> >> >> >> >
>> >> >> >> > -Anupam
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <
>> daddywri@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> The Solr handler that you are using should not matter here.
>> >> >> >> >>
>> >> >> >> >> Can you look at the Simple History report, and do the
>> following:
>> >> >> >> >>
>> >> >> >> >> - Look for a document that is being indexed in both PDF and
>> XML.
>> >> >> >> >> - Find the "ingestion" activity for that document for both PDF
>> >> >> >> >> and
>> >> >> >> >> XML
>> >> >> >> >> - Compare the ID's (which for the ingestion activity are the
>> >> >> >> >> URL's
>> >> >> >> >> of
>> >> >> >> >> the documents in Webtop)
>> >> >> >> >>
>> >> >> >> >> If the URLs are in fact different, then you should be able to
>> >> >> >> >> make
>> >> >> >> >> this work.  You need to look at how you configured your Solr
>> >> >> >> >> instance,
>> >> >> >> >> and which fields you are specifying in your Solr output
>> >> >> >> >> connection.
>> >> >> >> >> You want those Webtop urls to be indexed as the unique
>> document
>> >> >> >> >> identifier in Solr, not some other ID.
>> >> >> >> >>
>> >> >> >> >> Thanks,
>> >> >> >> >> Karl
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> >> >> >> >> <an...@gmail.com> wrote:
>> >> >> >> >> > Today I ran 2 job one by one but it seems since we are using
>> >> >> >> >> > /update/extract Request Handler the field values for common
>> id
>> >> >> >> >> > gets
>> >> >> >> >> > overridden by the latest job. I want to update certain
>> field in
>> >> >> >> >> > the
>> >> >> >> >> > lucene indexes for the doc rather than completely update
>> with
>> >> >> >> >> > new
>> >> >> >> >> > values and by loosing other field value entries.
>> >> >> >> >> >
>> >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
>> >> >> >> >> > <da...@gmail.com>
>> >> >> >> >> > wrote:
>> >> >> >> >> >> For Documentum, content length is in bytes, I believe.  It
>> >> >> >> >> >> does
>> >> >> >> >> >> not
>> >> >> >> >> >> set the length, it filters out all documents greater than
>> the
>> >> >> >> >> >> specified length.  Leaving the field blank will perform no
>> >> >> >> >> >> filtering.
>> >> >> >> >> >>
>> >> >> >> >> >> Document types in Documentum are specified by mime type, so
>> >> >> >> >> >> you'd
>> >> >> >> >> >> want
>> >> >> >> >> >> to select all that apply.  The actual one used will depend
>> on
>> >> >> >> >> >> how
>> >> >> >> >> >> your
>> >> >> >> >> >> particular instance of Documentum is configured, but if you
>> >> >> >> >> >> pick
>> >> >> >> >> >> them
>> >> >> >> >> >> all you should have no problem.
>> >> >> >> >> >>
>> >> >> >> >> >> Karl
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> >> >> >> >> <an...@gmail.com> wrote:
>> >> >> >> >> >>> Thanks!! Seems from your explanation that i can update
>> same
>> >> >> >> >> >>> documents
>> >> >> >> >> >>> other
>> >> >> >> >> >>> field values. I inquired about this because I have two
>> >> >> >> >> >>> different
>> >> >> >> >> >>> document
>> >> >> >> >> >>> with a parent-child relationship which needs to be
>> indexed as
>> >> >> >> >> >>> one
>> >> >> >> >> >>> document
>> >> >> >> >> >>> in lucene index.
>> >> >> >> >> >>>
>> >> >> >> >> >>> As you must have understood by now that i am trying to do
>> >> >> >> >> >>> this
>> >> >> >> >> >>> for
>> >> >> >> >> >>> Documentum CMS. I have seen the configuration screen for
>> >> >> >> >> >>> setting
>> >> >> >> >> >>> the
>> >> >> >> >> >>> Content
>> >> >> >> >> >>> length & second for filtering document type. So my
>> question
>> >> >> >> >> >>> is
>> >> >> >> >> >>> what
>> >> >> >> >> >>> unit the
>> >> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) &
>> whether
>> >> >> >> >> >>> this
>> >> >> >> >> >>> configuration set the lengths for documents full text
>> >> >> >> >> >>> indexing
>> >> >> >> >> >>> ?.
>> >> >> >> >> >>>
>> >> >> >> >> >>> Additionally to scan only one kind of document e.g PDF
>> what
>> >> >> >> >> >>> should
>> >> >> >> >> >>> be
>> >> >> >> >> >>> added
>> >> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
>> >> >> >> >> >>>
>> >> >> >> >> >>> Regards
>> >> >> >> >> >>> Anupam
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
>> >> >> >> >> >>> <da...@gmail.com>
>> >> >> >> >> >>> wrote:
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> The document key in Solr is the url of the document, as
>> >> >> >> >> >>>> constructed
>> >> >> >> >> >>>> by
>> >> >> >> >> >>>> the connector you are using.  If you are using the same
>> >> >> >> >> >>>> document
>> >> >> >> >> >>>> to
>> >> >> >> >> >>>> construct two different Solr documents, ManifoldCF by
>> >> >> >> >> >>>> definition
>> >> >> >> >> >>>> cannot be aware of this.  But if these are different
>> files
>> >> >> >> >> >>>> from
>> >> >> >> >> >>>> the
>> >> >> >> >> >>>> point of view of ManifoldCF they will have different URLs
>> >> >> >> >> >>>> and
>> >> >> >> >> >>>> be
>> >> >> >> >> >>>> treated differently.  The jobs can overlap in this case
>> with
>> >> >> >> >> >>>> no
>> >> >> >> >> >>>> difficulty.
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> Karl
>> >> >> >> >> >>>>
>> >> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> >> >> >> >> >>>> <an...@gmail.com> wrote:
>> >> >> >> >> >>>> > I want to configure two jobs to index in SOLR using
>> >> >> >> >> >>>> > ManifoldCF
>> >> >> >> >> >>>> > using
>> >> >> >> >> >>>> > /extract/update requestHandler.
>> >> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize
>> the
>> >> >> >> >> >>>> > PDF
>> >> >> >> >> >>>> > file.
>> >> >> >> >> >>>> > If both these document share a unique id. Can i combine
>> >> >> >> >> >>>> > the
>> >> >> >> >> >>>> > indexes
>> >> >> >> >> >>>> > for
>> >> >> >> >> >>>> > both
>> >> >> >> >> >>>> > in 1 SOLR schema without overriding the details added
>> by
>> >> >> >> >> >>>> > previous
>> >> >> >> >> >>>> > job.
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > suppose,
>> >> >> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
>> >> >> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
>> >> >> >> >> >>>> > field2,
>> >> >> >> >> >>>> > field3,
>> >> >> >> >> >>>> > field4, field5, field6
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> > Regards
>> >> >> >> >> >>>> > Anupam
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>> >
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >>>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Thanks & Regards
>> >> >> >> >> > Anupam Bhattacharya
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Thanks & Regards
>> >> >> >> > Anupam Bhattacharya
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Thanks & Regards
>> >> >> > Anupam Bhattacharya
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

Okay. I tried to use the id which is formed my manifoldcf documentum
connector. I ran the job i could see in between from the SOLR admin screen
that documents were getting indexed. But just after the end of the job i
see all my created indexes gets deleted.

Snippet from Simple History is given below.

Why this document deletion activity gets added and deletes all my created
indexes when i keep the unique id as "id" in the schema.xml file of SOLR ?

Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp>Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
IdentifierResult Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp>Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result
Description
03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78000676d
200 0 110
03-29-2012 12:55:37.869 fetch 09d905e78000676d
REJECTED 86823 4184
03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78000676d
200 8158 235

On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <da...@gmail.com> wrote:

> "So do you find this design appropriate and feasible ?"  It sounds
> like you are still trying to merge records in Solr but this time using
> Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
> think you will find it easy to keep data from one job aligned with
> data from another.  That's why I suggested just allowing both kinds of
> documents to be indexed as-is, and just making sure that you include a
> metadata reference to the main document in each.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > The second option seems to be more useful as it will allow me add to any
> > business logic.
> > So similar to SOLR Cell (/update/extract) my new RequestHandler will be
> > added in solrconfig.xml which will do all the manipulations.
> > Later, I need to get all field values into a temp variable by first
> > searching by id in the lucene indexes and then add these values into the
> > incoming new field values list.
> >
> > So do you find this design appropriate and feasible ?
> >
> > Anupam
> >
> > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>
> >> Thanks - now I understand what you are trying to do more clearly.
> >>
> >> The Documentum connector is going to pick up the XML document and the
> >> PDF document as separate entities.  Thus, they'd also be indexed in
> >> Solr separately.  So if we use that as a starting point, let's see
> >> where it might lead.
> >>
> >> First, you'd want each PDF document to have metadata that refers back
> >> to the XML parent document.  I'm not sure how easy it is to set up
> >> such a metadata reference in Documentum, but I vaguely recall there
> >> was indeed some such field.  So let's presume you can get that.  Then,
> >> you'd want to make sure your Solr schema included an "XML document"
> >> field, which had the URL of the parent XML document (or, for XML
> >> documents, the document's own URL) as content.  That would be the ID
> >> you'd use to present a result item to a user.
> >>
> >> Does this sound reasonable so far?
> >>
> >> The only other piece you might need is manipulation of either the
> >> PDF's metadata, or the XML document's metadata, or both.  For that,
> >> I'd use Solr Cell to perform whatever mappings and manipulations made
> >> sense before the documents actually get indexed.
> >>
> >> Karl
> >>
> >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
> >> <an...@gmail.com> wrote:
> >> > I would have been happy if  I had to index PDF and XML separately.
> >> > But for my use-case. XML is the main document containing bibliographic
> >> > information (which needs to presented as search result) and consists a
> >> > reference to a child/supporting document which is a actual PDF file. I
> >> > need
> >> > to index the PDF text and if any search matches with the PDF content
> >> > then
> >> > the parent/XML bibliographic information needs to be presented.
> >> >
> >> > I am trying to call the SOLR search engine with one single query to
> show
> >> > corresponding XML detail for a search term present in the PDF. I
> checked
> >> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
> >> > (http://wiki.apache.org/solr/Join) but work like inner query.
> >> >
> >> > Again the main requirement is that the PDF should be searchable but it
> >> > master details from XML should only be presented to request the actual
> >> > PDF.
> >> >
> >> > -Anupam
> >> >
> >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com>
> >> > wrote:
> >> >>
> >> >> This doesn't sound like a problem a connector can solve.  The problem
> >> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
> >> >> wrong document key and Lucene does not let you modify a document
> index
> >> >> once it is created, and no matter what you do to ManifoldCF it can't
> >> >> get around that restriction.  So it sounds like you need to
> >> >> fundamentally rethink your design.
> >> >>
> >> >> If all you want to do is index XML and PDF as separate documents,
> just
> >> >> change your Solr output connection specification to change the
> >> >> selected "id" field appropriately.  Then, BOTH documents will be
> >> >> indexed by Solr, each with different metadata as you originally
> >> >> specified.  I'm frankly having a really hard time seeing why this is
> >> >> so hard.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
> >> >> <an...@gmail.com> wrote:
> >> >> > Should I write a new Documentum Connector with our specific
> use-case
> >> >> > to
> >> >> > go
> >> >> > forward ?
> >> >> > I guess your book will be helpful to understand connector framework
> >> >> > in
> >> >> > manifoldcf.
> >> >> >
> >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Right, LUCENE never did allow you to modify a document's indexes,
> >> >> >> only
> >> >> >> replace them.  What I'm trying to tell you is that there is no
> >> >> >> reason
> >> >> >> to have the same document ID for both documents.  ManifoldCF will
> >> >> >> support treating the XML document and PDF document as different
> >> >> >> documents in Solr.  But if you want them to in fact be the same
> >> >> >> document, just combined in some way, neither ManifoldCF nor Lucene
> >> >> >> will support that at this time.
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> >> >> >> <an...@gmail.com> wrote:
> >> >> >> > I saw that the index getting created by 1st PDF indexing job
> which
> >> >> >> > worked
> >> >> >> > perfectly well for a particular id. Later when i ran the 2nd XML
> >> >> >> > indexing
> >> >> >> > Job for the same id. I lost all field indexed by the 1st job
> and i
> >> >> >> > was
> >> >> >> > left
> >> >> >> > out with field indexes created my this 2nd job.
> >> >> >> >
> >> >> >> > I thought that it would combine field values for a specified doc
> >> >> >> > id.
> >> >> >> >
> >> >> >> > As per Lucene developers they mention that by design Lucene
> >> >> >> > doesn't
> >> >> >> > support
> >> >> >> > this.
> >> >> >> >
> >> >> >> > Pls. see following url ::
> >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
> >> >> >> >
> >> >> >> > -Anupam
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <
> daddywri@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> The Solr handler that you are using should not matter here.
> >> >> >> >>
> >> >> >> >> Can you look at the Simple History report, and do the
> following:
> >> >> >> >>
> >> >> >> >> - Look for a document that is being indexed in both PDF and
> XML.
> >> >> >> >> - Find the "ingestion" activity for that document for both PDF
> >> >> >> >> and
> >> >> >> >> XML
> >> >> >> >> - Compare the ID's (which for the ingestion activity are the
> >> >> >> >> URL's
> >> >> >> >> of
> >> >> >> >> the documents in Webtop)
> >> >> >> >>
> >> >> >> >> If the URLs are in fact different, then you should be able to
> >> >> >> >> make
> >> >> >> >> this work.  You need to look at how you configured your Solr
> >> >> >> >> instance,
> >> >> >> >> and which fields you are specifying in your Solr output
> >> >> >> >> connection.
> >> >> >> >> You want those Webtop urls to be indexed as the unique document
> >> >> >> >> identifier in Solr, not some other ID.
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >> Karl
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> >> >> >> <an...@gmail.com> wrote:
> >> >> >> >> > Today I ran 2 job one by one but it seems since we are using
> >> >> >> >> > /update/extract Request Handler the field values for common
> id
> >> >> >> >> > gets
> >> >> >> >> > overridden by the latest job. I want to update certain field
> in
> >> >> >> >> > the
> >> >> >> >> > lucene indexes for the doc rather than completely update with
> >> >> >> >> > new
> >> >> >> >> > values and by loosing other field value entries.
> >> >> >> >> >
> >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
> >> >> >> >> > <da...@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >> For Documentum, content length is in bytes, I believe.  It
> >> >> >> >> >> does
> >> >> >> >> >> not
> >> >> >> >> >> set the length, it filters out all documents greater than
> the
> >> >> >> >> >> specified length.  Leaving the field blank will perform no
> >> >> >> >> >> filtering.
> >> >> >> >> >>
> >> >> >> >> >> Document types in Documentum are specified by mime type, so
> >> >> >> >> >> you'd
> >> >> >> >> >> want
> >> >> >> >> >> to select all that apply.  The actual one used will depend
> on
> >> >> >> >> >> how
> >> >> >> >> >> your
> >> >> >> >> >> particular instance of Documentum is configured, but if you
> >> >> >> >> >> pick
> >> >> >> >> >> them
> >> >> >> >> >> all you should have no problem.
> >> >> >> >> >>
> >> >> >> >> >> Karl
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> >> >> >> <an...@gmail.com> wrote:
> >> >> >> >> >>> Thanks!! Seems from your explanation that i can update same
> >> >> >> >> >>> documents
> >> >> >> >> >>> other
> >> >> >> >> >>> field values. I inquired about this because I have two
> >> >> >> >> >>> different
> >> >> >> >> >>> document
> >> >> >> >> >>> with a parent-child relationship which needs to be indexed
> as
> >> >> >> >> >>> one
> >> >> >> >> >>> document
> >> >> >> >> >>> in lucene index.
> >> >> >> >> >>>
> >> >> >> >> >>> As you must have understood by now that i am trying to do
> >> >> >> >> >>> this
> >> >> >> >> >>> for
> >> >> >> >> >>> Documentum CMS. I have seen the configuration screen for
> >> >> >> >> >>> setting
> >> >> >> >> >>> the
> >> >> >> >> >>> Content
> >> >> >> >> >>> length & second for filtering document type. So my question
> >> >> >> >> >>> is
> >> >> >> >> >>> what
> >> >> >> >> >>> unit the
> >> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) &
> whether
> >> >> >> >> >>> this
> >> >> >> >> >>> configuration set the lengths for documents full text
> >> >> >> >> >>> indexing
> >> >> >> >> >>> ?.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally to scan only one kind of document e.g PDF what
> >> >> >> >> >>> should
> >> >> >> >> >>> be
> >> >> >> >> >>> added
> >> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
> >> >> >> >> >>>
> >> >> >> >> >>> Regards
> >> >> >> >> >>> Anupam
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
> >> >> >> >> >>> <da...@gmail.com>
> >> >> >> >> >>> wrote:
> >> >> >> >> >>>>
> >> >> >> >> >>>> The document key in Solr is the url of the document, as
> >> >> >> >> >>>> constructed
> >> >> >> >> >>>> by
> >> >> >> >> >>>> the connector you are using.  If you are using the same
> >> >> >> >> >>>> document
> >> >> >> >> >>>> to
> >> >> >> >> >>>> construct two different Solr documents, ManifoldCF by
> >> >> >> >> >>>> definition
> >> >> >> >> >>>> cannot be aware of this.  But if these are different files
> >> >> >> >> >>>> from
> >> >> >> >> >>>> the
> >> >> >> >> >>>> point of view of ManifoldCF they will have different URLs
> >> >> >> >> >>>> and
> >> >> >> >> >>>> be
> >> >> >> >> >>>> treated differently.  The jobs can overlap in this case
> with
> >> >> >> >> >>>> no
> >> >> >> >> >>>> difficulty.
> >> >> >> >> >>>>
> >> >> >> >> >>>> Karl
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >> >> >> >>>> <an...@gmail.com> wrote:
> >> >> >> >> >>>> > I want to configure two jobs to index in SOLR using
> >> >> >> >> >>>> > ManifoldCF
> >> >> >> >> >>>> > using
> >> >> >> >> >>>> > /extract/update requestHandler.
> >> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize
> the
> >> >> >> >> >>>> > PDF
> >> >> >> >> >>>> > file.
> >> >> >> >> >>>> > If both these document share a unique id. Can i combine
> >> >> >> >> >>>> > the
> >> >> >> >> >>>> > indexes
> >> >> >> >> >>>> > for
> >> >> >> >> >>>> > both
> >> >> >> >> >>>> > in 1 SOLR schema without overriding the details added by
> >> >> >> >> >>>> > previous
> >> >> >> >> >>>> > job.
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > suppose,
> >> >> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >> >> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
> >> >> >> >> >>>> > field2,
> >> >> >> >> >>>> > field3,
> >> >> >> >> >>>> > field4, field5, field6
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > Regards
> >> >> >> >> >>>> > Anupam
> >> >> >> >> >>>> >
> >> >> >> >> >>>> >
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Thanks & Regards
> >> >> >> >> > Anupam Bhattacharya
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Thanks & Regards
> >> >> >> > Anupam Bhattacharya
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards
> >> >> > Anupam Bhattacharya
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

"So do you find this design appropriate and feasible ?"  It sounds
like you are still trying to merge records in Solr but this time using
Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
think you will find it easy to keep data from one job aligned with
data from another.  That's why I suggested just allowing both kinds of
documents to be indexed as-is, and just making sure that you include a
metadata reference to the main document in each.

Karl


On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> The second option seems to be more useful as it will allow me add to any
> business logic.
> So similar to SOLR Cell (/update/extract) my new RequestHandler will be
> added in solrconfig.xml which will do all the manipulations.
> Later, I need to get all field values into a temp variable by first
> searching by id in the lucene indexes and then add these values into the
> incoming new field values list.
>
> So do you find this design appropriate and feasible ?
>
> Anupam
>
> On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> Thanks - now I understand what you are trying to do more clearly.
>>
>> The Documentum connector is going to pick up the XML document and the
>> PDF document as separate entities.  Thus, they'd also be indexed in
>> Solr separately.  So if we use that as a starting point, let's see
>> where it might lead.
>>
>> First, you'd want each PDF document to have metadata that refers back
>> to the XML parent document.  I'm not sure how easy it is to set up
>> such a metadata reference in Documentum, but I vaguely recall there
>> was indeed some such field.  So let's presume you can get that.  Then,
>> you'd want to make sure your Solr schema included an "XML document"
>> field, which had the URL of the parent XML document (or, for XML
>> documents, the document's own URL) as content.  That would be the ID
>> you'd use to present a result item to a user.
>>
>> Does this sound reasonable so far?
>>
>> The only other piece you might need is manipulation of either the
>> PDF's metadata, or the XML document's metadata, or both.  For that,
>> I'd use Solr Cell to perform whatever mappings and manipulations made
>> sense before the documents actually get indexed.
>>
>> Karl
>>
>> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > I would have been happy if  I had to index PDF and XML separately.
>> > But for my use-case. XML is the main document containing bibliographic
>> > information (which needs to presented as search result) and consists a
>> > reference to a child/supporting document which is a actual PDF file. I
>> > need
>> > to index the PDF text and if any search matches with the PDF content
>> > then
>> > the parent/XML bibliographic information needs to be presented.
>> >
>> > I am trying to call the SOLR search engine with one single query to show
>> > corresponding XML detail for a search term present in the PDF. I checked
>> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
>> > (http://wiki.apache.org/solr/Join) but work like inner query.
>> >
>> > Again the main requirement is that the PDF should be searchable but it
>> > master details from XML should only be presented to request the actual
>> > PDF.
>> >
>> > -Anupam
>> >
>> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com>
>> > wrote:
>> >>
>> >> This doesn't sound like a problem a connector can solve.  The problem
>> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
>> >> wrong document key and Lucene does not let you modify a document index
>> >> once it is created, and no matter what you do to ManifoldCF it can't
>> >> get around that restriction.  So it sounds like you need to
>> >> fundamentally rethink your design.
>> >>
>> >> If all you want to do is index XML and PDF as separate documents, just
>> >> change your Solr output connection specification to change the
>> >> selected "id" field appropriately.  Then, BOTH documents will be
>> >> indexed by Solr, each with different metadata as you originally
>> >> specified.  I'm frankly having a really hard time seeing why this is
>> >> so hard.
>> >>
>> >> Karl
>> >>
>> >>
>> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>> >> <an...@gmail.com> wrote:
>> >> > Should I write a new Documentum Connector with our specific use-case
>> >> > to
>> >> > go
>> >> > forward ?
>> >> > I guess your book will be helpful to understand connector framework
>> >> > in
>> >> > manifoldcf.
>> >> >
>> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Right, LUCENE never did allow you to modify a document's indexes,
>> >> >> only
>> >> >> replace them.  What I'm trying to tell you is that there is no
>> >> >> reason
>> >> >> to have the same document ID for both documents.  ManifoldCF will
>> >> >> support treating the XML document and PDF document as different
>> >> >> documents in Solr.  But if you want them to in fact be the same
>> >> >> document, just combined in some way, neither ManifoldCF nor Lucene
>> >> >> will support that at this time.
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>> >> >> <an...@gmail.com> wrote:
>> >> >> > I saw that the index getting created by 1st PDF indexing job which
>> >> >> > worked
>> >> >> > perfectly well for a particular id. Later when i ran the 2nd XML
>> >> >> > indexing
>> >> >> > Job for the same id. I lost all field indexed by the 1st job and i
>> >> >> > was
>> >> >> > left
>> >> >> > out with field indexes created my this 2nd job.
>> >> >> >
>> >> >> > I thought that it would combine field values for a specified doc
>> >> >> > id.
>> >> >> >
>> >> >> > As per Lucene developers they mention that by design Lucene
>> >> >> > doesn't
>> >> >> > support
>> >> >> > this.
>> >> >> >
>> >> >> > Pls. see following url ::
>> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>> >> >> >
>> >> >> > -Anupam
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> The Solr handler that you are using should not matter here.
>> >> >> >>
>> >> >> >> Can you look at the Simple History report, and do the following:
>> >> >> >>
>> >> >> >> - Look for a document that is being indexed in both PDF and XML.
>> >> >> >> - Find the "ingestion" activity for that document for both PDF
>> >> >> >> and
>> >> >> >> XML
>> >> >> >> - Compare the ID's (which for the ingestion activity are the
>> >> >> >> URL's
>> >> >> >> of
>> >> >> >> the documents in Webtop)
>> >> >> >>
>> >> >> >> If the URLs are in fact different, then you should be able to
>> >> >> >> make
>> >> >> >> this work.  You need to look at how you configured your Solr
>> >> >> >> instance,
>> >> >> >> and which fields you are specifying in your Solr output
>> >> >> >> connection.
>> >> >> >> You want those Webtop urls to be indexed as the unique document
>> >> >> >> identifier in Solr, not some other ID.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >> Karl
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> >> >> >> <an...@gmail.com> wrote:
>> >> >> >> > Today I ran 2 job one by one but it seems since we are using
>> >> >> >> > /update/extract Request Handler the field values for common id
>> >> >> >> > gets
>> >> >> >> > overridden by the latest job. I want to update certain field in
>> >> >> >> > the
>> >> >> >> > lucene indexes for the doc rather than completely update with
>> >> >> >> > new
>> >> >> >> > values and by loosing other field value entries.
>> >> >> >> >
>> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
>> >> >> >> > <da...@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >> For Documentum, content length is in bytes, I believe.  It
>> >> >> >> >> does
>> >> >> >> >> not
>> >> >> >> >> set the length, it filters out all documents greater than the
>> >> >> >> >> specified length.  Leaving the field blank will perform no
>> >> >> >> >> filtering.
>> >> >> >> >>
>> >> >> >> >> Document types in Documentum are specified by mime type, so
>> >> >> >> >> you'd
>> >> >> >> >> want
>> >> >> >> >> to select all that apply.  The actual one used will depend on
>> >> >> >> >> how
>> >> >> >> >> your
>> >> >> >> >> particular instance of Documentum is configured, but if you
>> >> >> >> >> pick
>> >> >> >> >> them
>> >> >> >> >> all you should have no problem.
>> >> >> >> >>
>> >> >> >> >> Karl
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> >> >> >> <an...@gmail.com> wrote:
>> >> >> >> >>> Thanks!! Seems from your explanation that i can update same
>> >> >> >> >>> documents
>> >> >> >> >>> other
>> >> >> >> >>> field values. I inquired about this because I have two
>> >> >> >> >>> different
>> >> >> >> >>> document
>> >> >> >> >>> with a parent-child relationship which needs to be indexed as
>> >> >> >> >>> one
>> >> >> >> >>> document
>> >> >> >> >>> in lucene index.
>> >> >> >> >>>
>> >> >> >> >>> As you must have understood by now that i am trying to do
>> >> >> >> >>> this
>> >> >> >> >>> for
>> >> >> >> >>> Documentum CMS. I have seen the configuration screen for
>> >> >> >> >>> setting
>> >> >> >> >>> the
>> >> >> >> >>> Content
>> >> >> >> >>> length & second for filtering document type. So my question
>> >> >> >> >>> is
>> >> >> >> >>> what
>> >> >> >> >>> unit the
>> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether
>> >> >> >> >>> this
>> >> >> >> >>> configuration set the lengths for documents full text
>> >> >> >> >>> indexing
>> >> >> >> >>> ?.
>> >> >> >> >>>
>> >> >> >> >>> Additionally to scan only one kind of document e.g PDF what
>> >> >> >> >>> should
>> >> >> >> >>> be
>> >> >> >> >>> added
>> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
>> >> >> >> >>>
>> >> >> >> >>> Regards
>> >> >> >> >>> Anupam
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
>> >> >> >> >>> <da...@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>>>
>> >> >> >> >>>> The document key in Solr is the url of the document, as
>> >> >> >> >>>> constructed
>> >> >> >> >>>> by
>> >> >> >> >>>> the connector you are using.  If you are using the same
>> >> >> >> >>>> document
>> >> >> >> >>>> to
>> >> >> >> >>>> construct two different Solr documents, ManifoldCF by
>> >> >> >> >>>> definition
>> >> >> >> >>>> cannot be aware of this.  But if these are different files
>> >> >> >> >>>> from
>> >> >> >> >>>> the
>> >> >> >> >>>> point of view of ManifoldCF they will have different URLs
>> >> >> >> >>>> and
>> >> >> >> >>>> be
>> >> >> >> >>>> treated differently.  The jobs can overlap in this case with
>> >> >> >> >>>> no
>> >> >> >> >>>> difficulty.
>> >> >> >> >>>>
>> >> >> >> >>>> Karl
>> >> >> >> >>>>
>> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> >> >> >> >>>> <an...@gmail.com> wrote:
>> >> >> >> >>>> > I want to configure two jobs to index in SOLR using
>> >> >> >> >>>> > ManifoldCF
>> >> >> >> >>>> > using
>> >> >> >> >>>> > /extract/update requestHandler.
>> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the
>> >> >> >> >>>> > PDF
>> >> >> >> >>>> > file.
>> >> >> >> >>>> > If both these document share a unique id. Can i combine
>> >> >> >> >>>> > the
>> >> >> >> >>>> > indexes
>> >> >> >> >>>> > for
>> >> >> >> >>>> > both
>> >> >> >> >>>> > in 1 SOLR schema without overriding the details added by
>> >> >> >> >>>> > previous
>> >> >> >> >>>> > job.
>> >> >> >> >>>> >
>> >> >> >> >>>> > suppose,
>> >> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
>> >> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >> >> >> >>>> >
>> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
>> >> >> >> >>>> > field2,
>> >> >> >> >>>> > field3,
>> >> >> >> >>>> > field4, field5, field6
>> >> >> >> >>>> >
>> >> >> >> >>>> > Regards
>> >> >> >> >>>> > Anupam
>> >> >> >> >>>> >
>> >> >> >> >>>> >
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Thanks & Regards
>> >> >> >> > Anupam Bhattacharya
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Thanks & Regards
>> >> >> > Anupam Bhattacharya
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

The second option seems to be more useful as it will allow me add to any
business logic.
So similar to SOLR Cell (/update/extract) my new RequestHandler will be
added in solrconfig.xml which will do all the manipulations.
Later, I need to get all field values into a temp variable by first
searching by id in the lucene indexes and then add these values into the
incoming new field values list.

So do you find this design appropriate and feasible ?

Anupam

On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <da...@gmail.com> wrote:

> Thanks - now I understand what you are trying to do more clearly.
>
> The Documentum connector is going to pick up the XML document and the
> PDF document as separate entities.  Thus, they'd also be indexed in
> Solr separately.  So if we use that as a starting point, let's see
> where it might lead.
>
> First, you'd want each PDF document to have metadata that refers back
> to the XML parent document.  I'm not sure how easy it is to set up
> such a metadata reference in Documentum, but I vaguely recall there
> was indeed some such field.  So let's presume you can get that.  Then,
> you'd want to make sure your Solr schema included an "XML document"
> field, which had the URL of the parent XML document (or, for XML
> documents, the document's own URL) as content.  That would be the ID
> you'd use to present a result item to a user.
>
> Does this sound reasonable so far?
>
> The only other piece you might need is manipulation of either the
> PDF's metadata, or the XML document's metadata, or both.  For that,
> I'd use Solr Cell to perform whatever mappings and manipulations made
> sense before the documents actually get indexed.
>
> Karl
>
> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > I would have been happy if  I had to index PDF and XML separately.
> > But for my use-case. XML is the main document containing bibliographic
> > information (which needs to presented as search result) and consists a
> > reference to a child/supporting document which is a actual PDF file. I
> need
> > to index the PDF text and if any search matches with the PDF content then
> > the parent/XML bibliographic information needs to be presented.
> >
> > I am trying to call the SOLR search engine with one single query to show
> > corresponding XML detail for a search term present in the PDF. I checked
> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
> > (http://wiki.apache.org/solr/Join) but work like inner query.
> >
> > Again the main requirement is that the PDF should be searchable but it
> > master details from XML should only be presented to request the actual
> PDF.
> >
> > -Anupam
> >
> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>
> >> This doesn't sound like a problem a connector can solve.  The problem
> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
> >> wrong document key and Lucene does not let you modify a document index
> >> once it is created, and no matter what you do to ManifoldCF it can't
> >> get around that restriction.  So it sounds like you need to
> >> fundamentally rethink your design.
> >>
> >> If all you want to do is index XML and PDF as separate documents, just
> >> change your Solr output connection specification to change the
> >> selected "id" field appropriately.  Then, BOTH documents will be
> >> indexed by Solr, each with different metadata as you originally
> >> specified.  I'm frankly having a really hard time seeing why this is
> >> so hard.
> >>
> >> Karl
> >>
> >>
> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
> >> <an...@gmail.com> wrote:
> >> > Should I write a new Documentum Connector with our specific use-case
> to
> >> > go
> >> > forward ?
> >> > I guess your book will be helpful to understand connector framework in
> >> > manifoldcf.
> >> >
> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> >>
> >> >> Right, LUCENE never did allow you to modify a document's indexes,
> only
> >> >> replace them.  What I'm trying to tell you is that there is no reason
> >> >> to have the same document ID for both documents.  ManifoldCF will
> >> >> support treating the XML document and PDF document as different
> >> >> documents in Solr.  But if you want them to in fact be the same
> >> >> document, just combined in some way, neither ManifoldCF nor Lucene
> >> >> will support that at this time.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> >> >> <an...@gmail.com> wrote:
> >> >> > I saw that the index getting created by 1st PDF indexing job which
> >> >> > worked
> >> >> > perfectly well for a particular id. Later when i ran the 2nd XML
> >> >> > indexing
> >> >> > Job for the same id. I lost all field indexed by the 1st job and i
> >> >> > was
> >> >> > left
> >> >> > out with field indexes created my this 2nd job.
> >> >> >
> >> >> > I thought that it would combine field values for a specified doc
> id.
> >> >> >
> >> >> > As per Lucene developers they mention that by design Lucene doesn't
> >> >> > support
> >> >> > this.
> >> >> >
> >> >> > Pls. see following url ::
> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
> >> >> >
> >> >> > -Anupam
> >> >> >
> >> >> >
> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> The Solr handler that you are using should not matter here.
> >> >> >>
> >> >> >> Can you look at the Simple History report, and do the following:
> >> >> >>
> >> >> >> - Look for a document that is being indexed in both PDF and XML.
> >> >> >> - Find the "ingestion" activity for that document for both PDF and
> >> >> >> XML
> >> >> >> - Compare the ID's (which for the ingestion activity are the URL's
> >> >> >> of
> >> >> >> the documents in Webtop)
> >> >> >>
> >> >> >> If the URLs are in fact different, then you should be able to make
> >> >> >> this work.  You need to look at how you configured your Solr
> >> >> >> instance,
> >> >> >> and which fields you are specifying in your Solr output
> connection.
> >> >> >> You want those Webtop urls to be indexed as the unique document
> >> >> >> identifier in Solr, not some other ID.
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> >> >> <an...@gmail.com> wrote:
> >> >> >> > Today I ran 2 job one by one but it seems since we are using
> >> >> >> > /update/extract Request Handler the field values for common id
> >> >> >> > gets
> >> >> >> > overridden by the latest job. I want to update certain field in
> >> >> >> > the
> >> >> >> > lucene indexes for the doc rather than completely update with
> new
> >> >> >> > values and by loosing other field value entries.
> >> >> >> >
> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <
> daddywri@gmail.com>
> >> >> >> > wrote:
> >> >> >> >> For Documentum, content length is in bytes, I believe.  It does
> >> >> >> >> not
> >> >> >> >> set the length, it filters out all documents greater than the
> >> >> >> >> specified length.  Leaving the field blank will perform no
> >> >> >> >> filtering.
> >> >> >> >>
> >> >> >> >> Document types in Documentum are specified by mime type, so
> you'd
> >> >> >> >> want
> >> >> >> >> to select all that apply.  The actual one used will depend on
> how
> >> >> >> >> your
> >> >> >> >> particular instance of Documentum is configured, but if you
> pick
> >> >> >> >> them
> >> >> >> >> all you should have no problem.
> >> >> >> >>
> >> >> >> >> Karl
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> >> >> <an...@gmail.com> wrote:
> >> >> >> >>> Thanks!! Seems from your explanation that i can update same
> >> >> >> >>> documents
> >> >> >> >>> other
> >> >> >> >>> field values. I inquired about this because I have two
> different
> >> >> >> >>> document
> >> >> >> >>> with a parent-child relationship which needs to be indexed as
> >> >> >> >>> one
> >> >> >> >>> document
> >> >> >> >>> in lucene index.
> >> >> >> >>>
> >> >> >> >>> As you must have understood by now that i am trying to do this
> >> >> >> >>> for
> >> >> >> >>> Documentum CMS. I have seen the configuration screen for
> setting
> >> >> >> >>> the
> >> >> >> >>> Content
> >> >> >> >>> length & second for filtering document type. So my question is
> >> >> >> >>> what
> >> >> >> >>> unit the
> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether
> >> >> >> >>> this
> >> >> >> >>> configuration set the lengths for documents full text indexing
> >> >> >> >>> ?.
> >> >> >> >>>
> >> >> >> >>> Additionally to scan only one kind of document e.g PDF what
> >> >> >> >>> should
> >> >> >> >>> be
> >> >> >> >>> added
> >> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
> >> >> >> >>>
> >> >> >> >>> Regards
> >> >> >> >>> Anupam
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
> >> >> >> >>> <da...@gmail.com>
> >> >> >> >>> wrote:
> >> >> >> >>>>
> >> >> >> >>>> The document key in Solr is the url of the document, as
> >> >> >> >>>> constructed
> >> >> >> >>>> by
> >> >> >> >>>> the connector you are using.  If you are using the same
> >> >> >> >>>> document
> >> >> >> >>>> to
> >> >> >> >>>> construct two different Solr documents, ManifoldCF by
> >> >> >> >>>> definition
> >> >> >> >>>> cannot be aware of this.  But if these are different files
> from
> >> >> >> >>>> the
> >> >> >> >>>> point of view of ManifoldCF they will have different URLs and
> >> >> >> >>>> be
> >> >> >> >>>> treated differently.  The jobs can overlap in this case with
> no
> >> >> >> >>>> difficulty.
> >> >> >> >>>>
> >> >> >> >>>> Karl
> >> >> >> >>>>
> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >> >> >>>> <an...@gmail.com> wrote:
> >> >> >> >>>> > I want to configure two jobs to index in SOLR using
> >> >> >> >>>> > ManifoldCF
> >> >> >> >>>> > using
> >> >> >> >>>> > /extract/update requestHandler.
> >> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the
> >> >> >> >>>> > PDF
> >> >> >> >>>> > file.
> >> >> >> >>>> > If both these document share a unique id. Can i combine the
> >> >> >> >>>> > indexes
> >> >> >> >>>> > for
> >> >> >> >>>> > both
> >> >> >> >>>> > in 1 SOLR schema without overriding the details added by
> >> >> >> >>>> > previous
> >> >> >> >>>> > job.
> >> >> >> >>>> >
> >> >> >> >>>> > suppose,
> >> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >> >> >> >>>> >
> >> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
> >> >> >> >>>> > field2,
> >> >> >> >>>> > field3,
> >> >> >> >>>> > field4, field5, field6
> >> >> >> >>>> >
> >> >> >> >>>> > Regards
> >> >> >> >>>> > Anupam
> >> >> >> >>>> >
> >> >> >> >>>> >
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Thanks & Regards
> >> >> >> > Anupam Bhattacharya
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards
> >> >> > Anupam Bhattacharya
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

Thanks - now I understand what you are trying to do more clearly.

The Documentum connector is going to pick up the XML document and the
PDF document as separate entities.  Thus, they'd also be indexed in
Solr separately.  So if we use that as a starting point, let's see
where it might lead.

First, you'd want each PDF document to have metadata that refers back
to the XML parent document.  I'm not sure how easy it is to set up
such a metadata reference in Documentum, but I vaguely recall there
was indeed some such field.  So let's presume you can get that.  Then,
you'd want to make sure your Solr schema included an "XML document"
field, which had the URL of the parent XML document (or, for XML
documents, the document's own URL) as content.  That would be the ID
you'd use to present a result item to a user.

Does this sound reasonable so far?

The only other piece you might need is manipulation of either the
PDF's metadata, or the XML document's metadata, or both.  For that,
I'd use Solr Cell to perform whatever mappings and manipulations made
sense before the documents actually get indexed.

Karl

On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> I would have been happy if  I had to index PDF and XML separately.
> But for my use-case. XML is the main document containing bibliographic
> information (which needs to presented as search result) and consists a
> reference to a child/supporting document which is a actual PDF file. I need
> to index the PDF text and if any search matches with the PDF content then
> the parent/XML bibliographic information needs to be presented.
>
> I am trying to call the SOLR search engine with one single query to show
> corresponding XML detail for a search term present in the PDF. I checked
> that from SOLR 4.x version SOLR-Join Plugin is introduced.
> (http://wiki.apache.org/solr/Join) but work like inner query.
>
> Again the main requirement is that the PDF should be searchable but it
> master details from XML should only be presented to request the actual PDF.
>
> -Anupam
>
> On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> This doesn't sound like a problem a connector can solve.  The problem
>> sounds like severe misuse of Solr/Lucene to me.  You are using the
>> wrong document key and Lucene does not let you modify a document index
>> once it is created, and no matter what you do to ManifoldCF it can't
>> get around that restriction.  So it sounds like you need to
>> fundamentally rethink your design.
>>
>> If all you want to do is index XML and PDF as separate documents, just
>> change your Solr output connection specification to change the
>> selected "id" field appropriately.  Then, BOTH documents will be
>> indexed by Solr, each with different metadata as you originally
>> specified.  I'm frankly having a really hard time seeing why this is
>> so hard.
>>
>> Karl
>>
>>
>> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > Should I write a new Documentum Connector with our specific use-case to
>> > go
>> > forward ?
>> > I guess your book will be helpful to understand connector framework in
>> > manifoldcf.
>> >
>> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> Right, LUCENE never did allow you to modify a document's indexes, only
>> >> replace them.  What I'm trying to tell you is that there is no reason
>> >> to have the same document ID for both documents.  ManifoldCF will
>> >> support treating the XML document and PDF document as different
>> >> documents in Solr.  But if you want them to in fact be the same
>> >> document, just combined in some way, neither ManifoldCF nor Lucene
>> >> will support that at this time.
>> >>
>> >> Karl
>> >>
>> >>
>> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>> >> <an...@gmail.com> wrote:
>> >> > I saw that the index getting created by 1st PDF indexing job which
>> >> > worked
>> >> > perfectly well for a particular id. Later when i ran the 2nd XML
>> >> > indexing
>> >> > Job for the same id. I lost all field indexed by the 1st job and i
>> >> > was
>> >> > left
>> >> > out with field indexes created my this 2nd job.
>> >> >
>> >> > I thought that it would combine field values for a specified doc id.
>> >> >
>> >> > As per Lucene developers they mention that by design Lucene doesn't
>> >> > support
>> >> > this.
>> >> >
>> >> > Pls. see following url ::
>> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>> >> >
>> >> > -Anupam
>> >> >
>> >> >
>> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> The Solr handler that you are using should not matter here.
>> >> >>
>> >> >> Can you look at the Simple History report, and do the following:
>> >> >>
>> >> >> - Look for a document that is being indexed in both PDF and XML.
>> >> >> - Find the "ingestion" activity for that document for both PDF and
>> >> >> XML
>> >> >> - Compare the ID's (which for the ingestion activity are the URL's
>> >> >> of
>> >> >> the documents in Webtop)
>> >> >>
>> >> >> If the URLs are in fact different, then you should be able to make
>> >> >> this work.  You need to look at how you configured your Solr
>> >> >> instance,
>> >> >> and which fields you are specifying in your Solr output connection.
>> >> >> You want those Webtop urls to be indexed as the unique document
>> >> >> identifier in Solr, not some other ID.
>> >> >>
>> >> >> Thanks,
>> >> >> Karl
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> >> >> <an...@gmail.com> wrote:
>> >> >> > Today I ran 2 job one by one but it seems since we are using
>> >> >> > /update/extract Request Handler the field values for common id
>> >> >> > gets
>> >> >> > overridden by the latest job. I want to update certain field in
>> >> >> > the
>> >> >> > lucene indexes for the doc rather than completely update with new
>> >> >> > values and by loosing other field value entries.
>> >> >> >
>> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
>> >> >> > wrote:
>> >> >> >> For Documentum, content length is in bytes, I believe.  It does
>> >> >> >> not
>> >> >> >> set the length, it filters out all documents greater than the
>> >> >> >> specified length.  Leaving the field blank will perform no
>> >> >> >> filtering.
>> >> >> >>
>> >> >> >> Document types in Documentum are specified by mime type, so you'd
>> >> >> >> want
>> >> >> >> to select all that apply.  The actual one used will depend on how
>> >> >> >> your
>> >> >> >> particular instance of Documentum is configured, but if you pick
>> >> >> >> them
>> >> >> >> all you should have no problem.
>> >> >> >>
>> >> >> >> Karl
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> >> >> <an...@gmail.com> wrote:
>> >> >> >>> Thanks!! Seems from your explanation that i can update same
>> >> >> >>> documents
>> >> >> >>> other
>> >> >> >>> field values. I inquired about this because I have two different
>> >> >> >>> document
>> >> >> >>> with a parent-child relationship which needs to be indexed as
>> >> >> >>> one
>> >> >> >>> document
>> >> >> >>> in lucene index.
>> >> >> >>>
>> >> >> >>> As you must have understood by now that i am trying to do this
>> >> >> >>> for
>> >> >> >>> Documentum CMS. I have seen the configuration screen for setting
>> >> >> >>> the
>> >> >> >>> Content
>> >> >> >>> length & second for filtering document type. So my question is
>> >> >> >>> what
>> >> >> >>> unit the
>> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether
>> >> >> >>> this
>> >> >> >>> configuration set the lengths for documents full text indexing
>> >> >> >>> ?.
>> >> >> >>>
>> >> >> >>> Additionally to scan only one kind of document e.g PDF what
>> >> >> >>> should
>> >> >> >>> be
>> >> >> >>> added
>> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
>> >> >> >>>
>> >> >> >>> Regards
>> >> >> >>> Anupam
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
>> >> >> >>> <da...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>>>
>> >> >> >>>> The document key in Solr is the url of the document, as
>> >> >> >>>> constructed
>> >> >> >>>> by
>> >> >> >>>> the connector you are using.  If you are using the same
>> >> >> >>>> document
>> >> >> >>>> to
>> >> >> >>>> construct two different Solr documents, ManifoldCF by
>> >> >> >>>> definition
>> >> >> >>>> cannot be aware of this.  But if these are different files from
>> >> >> >>>> the
>> >> >> >>>> point of view of ManifoldCF they will have different URLs and
>> >> >> >>>> be
>> >> >> >>>> treated differently.  The jobs can overlap in this case with no
>> >> >> >>>> difficulty.
>> >> >> >>>>
>> >> >> >>>> Karl
>> >> >> >>>>
>> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> >> >> >>>> <an...@gmail.com> wrote:
>> >> >> >>>> > I want to configure two jobs to index in SOLR using
>> >> >> >>>> > ManifoldCF
>> >> >> >>>> > using
>> >> >> >>>> > /extract/update requestHandler.
>> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the
>> >> >> >>>> > PDF
>> >> >> >>>> > file.
>> >> >> >>>> > If both these document share a unique id. Can i combine the
>> >> >> >>>> > indexes
>> >> >> >>>> > for
>> >> >> >>>> > both
>> >> >> >>>> > in 1 SOLR schema without overriding the details added by
>> >> >> >>>> > previous
>> >> >> >>>> > job.
>> >> >> >>>> >
>> >> >> >>>> > suppose,
>> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
>> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >> >> >>>> >
>> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1,
>> >> >> >>>> > field2,
>> >> >> >>>> > field3,
>> >> >> >>>> > field4, field5, field6
>> >> >> >>>> >
>> >> >> >>>> > Regards
>> >> >> >>>> > Anupam
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Thanks & Regards
>> >> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

I would have been happy if  I had to index PDF and XML separately.
But for my use-case. XML is the main document containing bibliographic
information (which needs to presented as search result) and consists a
reference to a child/supporting document which is a actual PDF file. I need
to index the PDF text and if any search matches with the PDF content then
the parent/XML bibliographic information needs to be presented.

I am trying to call the SOLR search engine with one single query to show
corresponding XML detail for a search term present in the PDF. I checked
that from SOLR 4.x version SOLR-Join Plugin is introduced. (
http://wiki.apache.org/solr/Join) but work like inner query.

Again the main requirement is that the PDF should be searchable but it
master details from XML should only be presented to request the actual PDF.

-Anupam

On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <da...@gmail.com> wrote:

> This doesn't sound like a problem a connector can solve.  The problem
> sounds like severe misuse of Solr/Lucene to me.  You are using the
> wrong document key and Lucene does not let you modify a document index
> once it is created, and no matter what you do to ManifoldCF it can't
> get around that restriction.  So it sounds like you need to
> fundamentally rethink your design.
>
> If all you want to do is index XML and PDF as separate documents, just
> change your Solr output connection specification to change the
> selected "id" field appropriately.  Then, BOTH documents will be
> indexed by Solr, each with different metadata as you originally
> specified.  I'm frankly having a really hard time seeing why this is
> so hard.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > Should I write a new Documentum Connector with our specific use-case to
> go
> > forward ?
> > I guess your book will be helpful to understand connector framework in
> > manifoldcf.
> >
> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> Right, LUCENE never did allow you to modify a document's indexes, only
> >> replace them.  What I'm trying to tell you is that there is no reason
> >> to have the same document ID for both documents.  ManifoldCF will
> >> support treating the XML document and PDF document as different
> >> documents in Solr.  But if you want them to in fact be the same
> >> document, just combined in some way, neither ManifoldCF nor Lucene
> >> will support that at this time.
> >>
> >> Karl
> >>
> >>
> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> >> <an...@gmail.com> wrote:
> >> > I saw that the index getting created by 1st PDF indexing job which
> >> > worked
> >> > perfectly well for a particular id. Later when i ran the 2nd XML
> >> > indexing
> >> > Job for the same id. I lost all field indexed by the 1st job and i was
> >> > left
> >> > out with field indexes created my this 2nd job.
> >> >
> >> > I thought that it would combine field values for a specified doc id.
> >> >
> >> > As per Lucene developers they mention that by design Lucene doesn't
> >> > support
> >> > this.
> >> >
> >> > Pls. see following url ::
> >> > https://issues.apache.org/jira/browse/LUCENE-3837
> >> >
> >> > -Anupam
> >> >
> >> >
> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> >>
> >> >> The Solr handler that you are using should not matter here.
> >> >>
> >> >> Can you look at the Simple History report, and do the following:
> >> >>
> >> >> - Look for a document that is being indexed in both PDF and XML.
> >> >> - Find the "ingestion" activity for that document for both PDF and
> XML
> >> >> - Compare the ID's (which for the ingestion activity are the URL's of
> >> >> the documents in Webtop)
> >> >>
> >> >> If the URLs are in fact different, then you should be able to make
> >> >> this work.  You need to look at how you configured your Solr
> instance,
> >> >> and which fields you are specifying in your Solr output connection.
> >> >> You want those Webtop urls to be indexed as the unique document
> >> >> identifier in Solr, not some other ID.
> >> >>
> >> >> Thanks,
> >> >> Karl
> >> >>
> >> >>
> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> >> <an...@gmail.com> wrote:
> >> >> > Today I ran 2 job one by one but it seems since we are using
> >> >> > /update/extract Request Handler the field values for common id gets
> >> >> > overridden by the latest job. I want to update certain field in the
> >> >> > lucene indexes for the doc rather than completely update with new
> >> >> > values and by loosing other field value entries.
> >> >> >
> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
> >> >> > wrote:
> >> >> >> For Documentum, content length is in bytes, I believe.  It does
> not
> >> >> >> set the length, it filters out all documents greater than the
> >> >> >> specified length.  Leaving the field blank will perform no
> >> >> >> filtering.
> >> >> >>
> >> >> >> Document types in Documentum are specified by mime type, so you'd
> >> >> >> want
> >> >> >> to select all that apply.  The actual one used will depend on how
> >> >> >> your
> >> >> >> particular instance of Documentum is configured, but if you pick
> >> >> >> them
> >> >> >> all you should have no problem.
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> >> <an...@gmail.com> wrote:
> >> >> >>> Thanks!! Seems from your explanation that i can update same
> >> >> >>> documents
> >> >> >>> other
> >> >> >>> field values. I inquired about this because I have two different
> >> >> >>> document
> >> >> >>> with a parent-child relationship which needs to be indexed as one
> >> >> >>> document
> >> >> >>> in lucene index.
> >> >> >>>
> >> >> >>> As you must have understood by now that i am trying to do this
> for
> >> >> >>> Documentum CMS. I have seen the configuration screen for setting
> >> >> >>> the
> >> >> >>> Content
> >> >> >>> length & second for filtering document type. So my question is
> what
> >> >> >>> unit the
> >> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether
> this
> >> >> >>> configuration set the lengths for documents full text indexing ?.
> >> >> >>>
> >> >> >>> Additionally to scan only one kind of document e.g PDF what
> should
> >> >> >>> be
> >> >> >>> added
> >> >> >>> to filter those documents? is it application/pdf OR PDF ?
> >> >> >>>
> >> >> >>> Regards
> >> >> >>> Anupam
> >> >> >>>
> >> >> >>>
> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <
> daddywri@gmail.com>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> The document key in Solr is the url of the document, as
> >> >> >>>> constructed
> >> >> >>>> by
> >> >> >>>> the connector you are using.  If you are using the same document
> >> >> >>>> to
> >> >> >>>> construct two different Solr documents, ManifoldCF by definition
> >> >> >>>> cannot be aware of this.  But if these are different files from
> >> >> >>>> the
> >> >> >>>> point of view of ManifoldCF they will have different URLs and be
> >> >> >>>> treated differently.  The jobs can overlap in this case with no
> >> >> >>>> difficulty.
> >> >> >>>>
> >> >> >>>> Karl
> >> >> >>>>
> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >> >>>> <an...@gmail.com> wrote:
> >> >> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
> >> >> >>>> > using
> >> >> >>>> > /extract/update requestHandler.
> >> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF
> >> >> >>>> > file.
> >> >> >>>> > If both these document share a unique id. Can i combine the
> >> >> >>>> > indexes
> >> >> >>>> > for
> >> >> >>>> > both
> >> >> >>>> > in 1 SOLR schema without overriding the details added by
> >> >> >>>> > previous
> >> >> >>>> > job.
> >> >> >>>> >
> >> >> >>>> > suppose,
> >> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >> >> >>>> >
> >> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
> >> >> >>>> > field3,
> >> >> >>>> > field4, field5, field6
> >> >> >>>> >
> >> >> >>>> > Regards
> >> >> >>>> > Anupam
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards
> >> >> > Anupam Bhattacharya
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

This doesn't sound like a problem a connector can solve.  The problem
sounds like severe misuse of Solr/Lucene to me.  You are using the
wrong document key and Lucene does not let you modify a document index
once it is created, and no matter what you do to ManifoldCF it can't
get around that restriction.  So it sounds like you need to
fundamentally rethink your design.

If all you want to do is index XML and PDF as separate documents, just
change your Solr output connection specification to change the
selected "id" field appropriately.  Then, BOTH documents will be
indexed by Solr, each with different metadata as you originally
specified.  I'm frankly having a really hard time seeing why this is
so hard.

Karl


On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> Should I write a new Documentum Connector with our specific use-case to go
> forward ?
> I guess your book will be helpful to understand connector framework in
> manifoldcf.
>
> On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> Right, LUCENE never did allow you to modify a document's indexes, only
>> replace them.  What I'm trying to tell you is that there is no reason
>> to have the same document ID for both documents.  ManifoldCF will
>> support treating the XML document and PDF document as different
>> documents in Solr.  But if you want them to in fact be the same
>> document, just combined in some way, neither ManifoldCF nor Lucene
>> will support that at this time.
>>
>> Karl
>>
>>
>> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > I saw that the index getting created by 1st PDF indexing job which
>> > worked
>> > perfectly well for a particular id. Later when i ran the 2nd XML
>> > indexing
>> > Job for the same id. I lost all field indexed by the 1st job and i was
>> > left
>> > out with field indexes created my this 2nd job.
>> >
>> > I thought that it would combine field values for a specified doc id.
>> >
>> > As per Lucene developers they mention that by design Lucene doesn't
>> > support
>> > this.
>> >
>> > Pls. see following url ::
>> > https://issues.apache.org/jira/browse/LUCENE-3837
>> >
>> > -Anupam
>> >
>> >
>> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >> The Solr handler that you are using should not matter here.
>> >>
>> >> Can you look at the Simple History report, and do the following:
>> >>
>> >> - Look for a document that is being indexed in both PDF and XML.
>> >> - Find the "ingestion" activity for that document for both PDF and XML
>> >> - Compare the ID's (which for the ingestion activity are the URL's of
>> >> the documents in Webtop)
>> >>
>> >> If the URLs are in fact different, then you should be able to make
>> >> this work.  You need to look at how you configured your Solr instance,
>> >> and which fields you are specifying in your Solr output connection.
>> >> You want those Webtop urls to be indexed as the unique document
>> >> identifier in Solr, not some other ID.
>> >>
>> >> Thanks,
>> >> Karl
>> >>
>> >>
>> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> >> <an...@gmail.com> wrote:
>> >> > Today I ran 2 job one by one but it seems since we are using
>> >> > /update/extract Request Handler the field values for common id gets
>> >> > overridden by the latest job. I want to update certain field in the
>> >> > lucene indexes for the doc rather than completely update with new
>> >> > values and by loosing other field value entries.
>> >> >
>> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
>> >> > wrote:
>> >> >> For Documentum, content length is in bytes, I believe.  It does not
>> >> >> set the length, it filters out all documents greater than the
>> >> >> specified length.  Leaving the field blank will perform no
>> >> >> filtering.
>> >> >>
>> >> >> Document types in Documentum are specified by mime type, so you'd
>> >> >> want
>> >> >> to select all that apply.  The actual one used will depend on how
>> >> >> your
>> >> >> particular instance of Documentum is configured, but if you pick
>> >> >> them
>> >> >> all you should have no problem.
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >>
>> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> >> <an...@gmail.com> wrote:
>> >> >>> Thanks!! Seems from your explanation that i can update same
>> >> >>> documents
>> >> >>> other
>> >> >>> field values. I inquired about this because I have two different
>> >> >>> document
>> >> >>> with a parent-child relationship which needs to be indexed as one
>> >> >>> document
>> >> >>> in lucene index.
>> >> >>>
>> >> >>> As you must have understood by now that i am trying to do this for
>> >> >>> Documentum CMS. I have seen the configuration screen for setting
>> >> >>> the
>> >> >>> Content
>> >> >>> length & second for filtering document type. So my question is what
>> >> >>> unit the
>> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
>> >> >>> configuration set the lengths for documents full text indexing ?.
>> >> >>>
>> >> >>> Additionally to scan only one kind of document e.g PDF what should
>> >> >>> be
>> >> >>> added
>> >> >>> to filter those documents? is it application/pdf OR PDF ?
>> >> >>>
>> >> >>> Regards
>> >> >>> Anupam
>> >> >>>
>> >> >>>
>> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> The document key in Solr is the url of the document, as
>> >> >>>> constructed
>> >> >>>> by
>> >> >>>> the connector you are using.  If you are using the same document
>> >> >>>> to
>> >> >>>> construct two different Solr documents, ManifoldCF by definition
>> >> >>>> cannot be aware of this.  But if these are different files from
>> >> >>>> the
>> >> >>>> point of view of ManifoldCF they will have different URLs and be
>> >> >>>> treated differently.  The jobs can overlap in this case with no
>> >> >>>> difficulty.
>> >> >>>>
>> >> >>>> Karl
>> >> >>>>
>> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> >> >>>> <an...@gmail.com> wrote:
>> >> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
>> >> >>>> > using
>> >> >>>> > /extract/update requestHandler.
>> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF
>> >> >>>> > file.
>> >> >>>> > If both these document share a unique id. Can i combine the
>> >> >>>> > indexes
>> >> >>>> > for
>> >> >>>> > both
>> >> >>>> > in 1 SOLR schema without overriding the details added by
>> >> >>>> > previous
>> >> >>>> > job.
>> >> >>>> >
>> >> >>>> > suppose,
>> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
>> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >> >>>> >
>> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
>> >> >>>> > field3,
>> >> >>>> > field4, field5, field6
>> >> >>>> >
>> >> >>>> > Regards
>> >> >>>> > Anupam
>> >> >>>> >
>> >> >>>> >
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

Should I write a new Documentum Connector with our specific use-case to go
forward ?
I guess your book will be helpful to understand connector framework in
manifoldcf.

On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <da...@gmail.com> wrote:

> Right, LUCENE never did allow you to modify a document's indexes, only
> replace them.  What I'm trying to tell you is that there is no reason
> to have the same document ID for both documents.  ManifoldCF will
> support treating the XML document and PDF document as different
> documents in Solr.  But if you want them to in fact be the same
> document, just combined in some way, neither ManifoldCF nor Lucene
> will support that at this time.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > I saw that the index getting created by 1st PDF indexing job which worked
> > perfectly well for a particular id. Later when i ran the 2nd XML indexing
> > Job for the same id. I lost all field indexed by the 1st job and i was
> left
> > out with field indexes created my this 2nd job.
> >
> > I thought that it would combine field values for a specified doc id.
> >
> > As per Lucene developers they mention that by design Lucene doesn't
> support
> > this.
> >
> > Pls. see following url ::
> > https://issues.apache.org/jira/browse/LUCENE-3837
> >
> > -Anupam
> >
> >
> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com> wrote:
> >>
> >> The Solr handler that you are using should not matter here.
> >>
> >> Can you look at the Simple History report, and do the following:
> >>
> >> - Look for a document that is being indexed in both PDF and XML.
> >> - Find the "ingestion" activity for that document for both PDF and XML
> >> - Compare the ID's (which for the ingestion activity are the URL's of
> >> the documents in Webtop)
> >>
> >> If the URLs are in fact different, then you should be able to make
> >> this work.  You need to look at how you configured your Solr instance,
> >> and which fields you are specifying in your Solr output connection.
> >> You want those Webtop urls to be indexed as the unique document
> >> identifier in Solr, not some other ID.
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> <an...@gmail.com> wrote:
> >> > Today I ran 2 job one by one but it seems since we are using
> >> > /update/extract Request Handler the field values for common id gets
> >> > overridden by the latest job. I want to update certain field in the
> >> > lucene indexes for the doc rather than completely update with new
> >> > values and by loosing other field value entries.
> >> >
> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
> >> > wrote:
> >> >> For Documentum, content length is in bytes, I believe.  It does not
> >> >> set the length, it filters out all documents greater than the
> >> >> specified length.  Leaving the field blank will perform no filtering.
> >> >>
> >> >> Document types in Documentum are specified by mime type, so you'd
> want
> >> >> to select all that apply.  The actual one used will depend on how
> your
> >> >> particular instance of Documentum is configured, but if you pick them
> >> >> all you should have no problem.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> >> <an...@gmail.com> wrote:
> >> >>> Thanks!! Seems from your explanation that i can update same
> documents
> >> >>> other
> >> >>> field values. I inquired about this because I have two different
> >> >>> document
> >> >>> with a parent-child relationship which needs to be indexed as one
> >> >>> document
> >> >>> in lucene index.
> >> >>>
> >> >>> As you must have understood by now that i am trying to do this for
> >> >>> Documentum CMS. I have seen the configuration screen for setting the
> >> >>> Content
> >> >>> length & second for filtering document type. So my question is what
> >> >>> unit the
> >> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
> >> >>> configuration set the lengths for documents full text indexing ?.
> >> >>>
> >> >>> Additionally to scan only one kind of document e.g PDF what should
> be
> >> >>> added
> >> >>> to filter those documents? is it application/pdf OR PDF ?
> >> >>>
> >> >>> Regards
> >> >>> Anupam
> >> >>>
> >> >>>
> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> The document key in Solr is the url of the document, as constructed
> >> >>>> by
> >> >>>> the connector you are using.  If you are using the same document to
> >> >>>> construct two different Solr documents, ManifoldCF by definition
> >> >>>> cannot be aware of this.  But if these are different files from the
> >> >>>> point of view of ManifoldCF they will have different URLs and be
> >> >>>> treated differently.  The jobs can overlap in this case with no
> >> >>>> difficulty.
> >> >>>>
> >> >>>> Karl
> >> >>>>
> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >> >>>> <an...@gmail.com> wrote:
> >> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
> >> >>>> > using
> >> >>>> > /extract/update requestHandler.
> >> >>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF
> >> >>>> > file.
> >> >>>> > If both these document share a unique id. Can i combine the
> indexes
> >> >>>> > for
> >> >>>> > both
> >> >>>> > in 1 SOLR schema without overriding the details added by previous
> >> >>>> > job.
> >> >>>> >
> >> >>>> > suppose,
> >> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >> >>>> >
> >> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
> >> >>>> > field3,
> >> >>>> > field4, field5, field6
> >> >>>> >
> >> >>>> > Regards
> >> >>>> > Anupam
> >> >>>> >
> >> >>>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

Right, LUCENE never did allow you to modify a document's indexes, only
replace them.  What I'm trying to tell you is that there is no reason
to have the same document ID for both documents.  ManifoldCF will
support treating the XML document and PDF document as different
documents in Solr.  But if you want them to in fact be the same
document, just combined in some way, neither ManifoldCF nor Lucene
will support that at this time.

Karl


On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> I saw that the index getting created by 1st PDF indexing job which worked
> perfectly well for a particular id. Later when i ran the 2nd XML indexing
> Job for the same id. I lost all field indexed by the 1st job and i was left
> out with field indexes created my this 2nd job.
>
> I thought that it would combine field values for a specified doc id.
>
> As per Lucene developers they mention that by design Lucene doesn't support
> this.
>
> Pls. see following url ::
> https://issues.apache.org/jira/browse/LUCENE-3837
>
> -Anupam
>
>
> On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> The Solr handler that you are using should not matter here.
>>
>> Can you look at the Simple History report, and do the following:
>>
>> - Look for a document that is being indexed in both PDF and XML.
>> - Find the "ingestion" activity for that document for both PDF and XML
>> - Compare the ID's (which for the ingestion activity are the URL's of
>> the documents in Webtop)
>>
>> If the URLs are in fact different, then you should be able to make
>> this work.  You need to look at how you configured your Solr instance,
>> and which fields you are specifying in your Solr output connection.
>> You want those Webtop urls to be indexed as the unique document
>> identifier in Solr, not some other ID.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > Today I ran 2 job one by one but it seems since we are using
>> > /update/extract Request Handler the field values for common id gets
>> > overridden by the latest job. I want to update certain field in the
>> > lucene indexes for the doc rather than completely update with new
>> > values and by loosing other field value entries.
>> >
>> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
>> > wrote:
>> >> For Documentum, content length is in bytes, I believe.  It does not
>> >> set the length, it filters out all documents greater than the
>> >> specified length.  Leaving the field blank will perform no filtering.
>> >>
>> >> Document types in Documentum are specified by mime type, so you'd want
>> >> to select all that apply.  The actual one used will depend on how your
>> >> particular instance of Documentum is configured, but if you pick them
>> >> all you should have no problem.
>> >>
>> >> Karl
>> >>
>> >>
>> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> <an...@gmail.com> wrote:
>> >>> Thanks!! Seems from your explanation that i can update same documents
>> >>> other
>> >>> field values. I inquired about this because I have two different
>> >>> document
>> >>> with a parent-child relationship which needs to be indexed as one
>> >>> document
>> >>> in lucene index.
>> >>>
>> >>> As you must have understood by now that i am trying to do this for
>> >>> Documentum CMS. I have seen the configuration screen for setting the
>> >>> Content
>> >>> length & second for filtering document type. So my question is what
>> >>> unit the
>> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
>> >>> configuration set the lengths for documents full text indexing ?.
>> >>>
>> >>> Additionally to scan only one kind of document e.g PDF what should be
>> >>> added
>> >>> to filter those documents? is it application/pdf OR PDF ?
>> >>>
>> >>> Regards
>> >>> Anupam
>> >>>
>> >>>
>> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> The document key in Solr is the url of the document, as constructed
>> >>>> by
>> >>>> the connector you are using.  If you are using the same document to
>> >>>> construct two different Solr documents, ManifoldCF by definition
>> >>>> cannot be aware of this.  But if these are different files from the
>> >>>> point of view of ManifoldCF they will have different URLs and be
>> >>>> treated differently.  The jobs can overlap in this case with no
>> >>>> difficulty.
>> >>>>
>> >>>> Karl
>> >>>>
>> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> >>>> <an...@gmail.com> wrote:
>> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF
>> >>>> > using
>> >>>> > /extract/update requestHandler.
>> >>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF
>> >>>> > file.
>> >>>> > If both these document share a unique id. Can i combine the indexes
>> >>>> > for
>> >>>> > both
>> >>>> > in 1 SOLR schema without overriding the details added by previous
>> >>>> > job.
>> >>>> >
>> >>>> > suppose,
>> >>>> >       xmldoc indexes field0(id), field1, field2, field3
>> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >>>> >
>> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
>> >>>> > field3,
>> >>>> > field4, field5, field6
>> >>>> >
>> >>>> > Regards
>> >>>> > Anupam
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>> >>>
>> >>>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

I saw that the index getting created by 1st PDF indexing job which worked
perfectly well for a particular id. Later when i ran the 2nd XML indexing
Job for the same id. I lost all field indexed by the 1st job and i was left
out with field indexes created my this 2nd job.

I thought that it would combine field values for a specified doc id.

As per Lucene developers they mention that by design Lucene doesn't support
this.

Pls. see following url ::  https://issues.apache.org/jira/browse/LUCENE-3837


-Anupam

On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <da...@gmail.com> wrote:

> The Solr handler that you are using should not matter here.
>
> Can you look at the Simple History report, and do the following:
>
> - Look for a document that is being indexed in both PDF and XML.
> - Find the "ingestion" activity for that document for both PDF and XML
> - Compare the ID's (which for the ingestion activity are the URL's of
> the documents in Webtop)
>
> If the URLs are in fact different, then you should be able to make
> this work.  You need to look at how you configured your Solr instance,
> and which fields you are specifying in your Solr output connection.
> You want those Webtop urls to be indexed as the unique document
> identifier in Solr, not some other ID.
>
> Thanks,
> Karl
>
>
> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > Today I ran 2 job one by one but it seems since we are using
> > /update/extract Request Handler the field values for common id gets
> > overridden by the latest job. I want to update certain field in the
> > lucene indexes for the doc rather than completely update with new
> > values and by loosing other field value entries.
> >
> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> For Documentum, content length is in bytes, I believe.  It does not
> >> set the length, it filters out all documents greater than the
> >> specified length.  Leaving the field blank will perform no filtering.
> >>
> >> Document types in Documentum are specified by mime type, so you'd want
> >> to select all that apply.  The actual one used will depend on how your
> >> particular instance of Documentum is configured, but if you pick them
> >> all you should have no problem.
> >>
> >> Karl
> >>
> >>
> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> >> <an...@gmail.com> wrote:
> >>> Thanks!! Seems from your explanation that i can update same documents
> other
> >>> field values. I inquired about this because I have two different
> document
> >>> with a parent-child relationship which needs to be indexed as one
> document
> >>> in lucene index.
> >>>
> >>> As you must have understood by now that i am trying to do this for
> >>> Documentum CMS. I have seen the configuration screen for setting the
> Content
> >>> length & second for filtering document type. So my question is what
> unit the
> >>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
> >>> configuration set the lengths for documents full text indexing ?.
> >>>
> >>> Additionally to scan only one kind of document e.g PDF what should be
> added
> >>> to filter those documents? is it application/pdf OR PDF ?
> >>>
> >>> Regards
> >>> Anupam
> >>>
> >>>
> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com>
> wrote:
> >>>>
> >>>> The document key in Solr is the url of the document, as constructed by
> >>>> the connector you are using.  If you are using the same document to
> >>>> construct two different Solr documents, ManifoldCF by definition
> >>>> cannot be aware of this.  But if these are different files from the
> >>>> point of view of ManifoldCF they will have different URLs and be
> >>>> treated differently.  The jobs can overlap in this case with no
> >>>> difficulty.
> >>>>
> >>>> Karl
> >>>>
> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> >>>> <an...@gmail.com> wrote:
> >>>> > I want to configure two jobs to index in SOLR using ManifoldCF using
> >>>> > /extract/update requestHandler.
> >>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
> >>>> > If both these document share a unique id. Can i combine the indexes
> for
> >>>> > both
> >>>> > in 1 SOLR schema without overriding the details added by previous
> job.
> >>>> >
> >>>> > suppose,
> >>>> >       xmldoc indexes field0(id), field1, field2, field3
> >>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >>>> >
> >>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2,
> field3,
> >>>> > field4, field5, field6
> >>>> >
> >>>> > Regards
> >>>> > Anupam
> >>>> >
> >>>> >
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

The Solr handler that you are using should not matter here.

Can you look at the Simple History report, and do the following:

- Look for a document that is being indexed in both PDF and XML.
- Find the "ingestion" activity for that document for both PDF and XML
- Compare the ID's (which for the ingestion activity are the URL's of
the documents in Webtop)

If the URLs are in fact different, then you should be able to make
this work.  You need to look at how you configured your Solr instance,
and which fields you are specifying in your Solr output connection.
You want those Webtop urls to be indexed as the unique document
identifier in Solr, not some other ID.

Thanks,
Karl


On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> Today I ran 2 job one by one but it seems since we are using
> /update/extract Request Handler the field values for common id gets
> overridden by the latest job. I want to update certain field in the
> lucene indexes for the doc rather than completely update with new
> values and by loosing other field value entries.
>
> On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com> wrote:
>> For Documentum, content length is in bytes, I believe.  It does not
>> set the length, it filters out all documents greater than the
>> specified length.  Leaving the field blank will perform no filtering.
>>
>> Document types in Documentum are specified by mime type, so you'd want
>> to select all that apply.  The actual one used will depend on how your
>> particular instance of Documentum is configured, but if you pick them
>> all you should have no problem.
>>
>> Karl
>>
>>
>> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>>> Thanks!! Seems from your explanation that i can update same documents other
>>> field values. I inquired about this because I have two different document
>>> with a parent-child relationship which needs to be indexed as one document
>>> in lucene index.
>>>
>>> As you must have understood by now that i am trying to do this for
>>> Documentum CMS. I have seen the configuration screen for setting the Content
>>> length & second for filtering document type. So my question is what unit the
>>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
>>> configuration set the lengths for documents full text indexing ?.
>>>
>>> Additionally to scan only one kind of document e.g PDF what should be added
>>> to filter those documents? is it application/pdf OR PDF ?
>>>
>>> Regards
>>> Anupam
>>>
>>>
>>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com> wrote:
>>>>
>>>> The document key in Solr is the url of the document, as constructed by
>>>> the connector you are using.  If you are using the same document to
>>>> construct two different Solr documents, ManifoldCF by definition
>>>> cannot be aware of this.  But if these are different files from the
>>>> point of view of ManifoldCF they will have different URLs and be
>>>> treated differently.  The jobs can overlap in this case with no
>>>> difficulty.
>>>>
>>>> Karl
>>>>
>>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>>>> <an...@gmail.com> wrote:
>>>> > I want to configure two jobs to index in SOLR using ManifoldCF using
>>>> > /extract/update requestHandler.
>>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
>>>> > If both these document share a unique id. Can i combine the indexes for
>>>> > both
>>>> > in 1 SOLR schema without overriding the details added by previous job.
>>>> >
>>>> > suppose,
>>>> >       xmldoc indexes field0(id), field1, field2, field3
>>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>>>> >
>>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
>>>> > field4, field5, field6
>>>> >
>>>> > Regards
>>>> > Anupam
>>>> >
>>>> >
>>>
>>>
>>>
>>>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

Today I ran 2 job one by one but it seems since we are using
/update/extract Request Handler the field values for common id gets
overridden by the latest job. I want to update certain field in the
lucene indexes for the doc rather than completely update with new
values and by loosing other field value entries.

On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright <da...@gmail.com> wrote:
> For Documentum, content length is in bytes, I believe.  It does not
> set the length, it filters out all documents greater than the
> specified length.  Leaving the field blank will perform no filtering.
>
> Document types in Documentum are specified by mime type, so you'd want
> to select all that apply.  The actual one used will depend on how your
> particular instance of Documentum is configured, but if you pick them
> all you should have no problem.
>
> Karl
>
>
> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
>> Thanks!! Seems from your explanation that i can update same documents other
>> field values. I inquired about this because I have two different document
>> with a parent-child relationship which needs to be indexed as one document
>> in lucene index.
>>
>> As you must have understood by now that i am trying to do this for
>> Documentum CMS. I have seen the configuration screen for setting the Content
>> length & second for filtering document type. So my question is what unit the
>> Content length accepts values (bit,bytes,KB,MB etc) & whether this
>> configuration set the lengths for documents full text indexing ?.
>>
>> Additionally to scan only one kind of document e.g PDF what should be added
>> to filter those documents? is it application/pdf OR PDF ?
>>
>> Regards
>> Anupam
>>
>>
>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>> The document key in Solr is the url of the document, as constructed by
>>> the connector you are using.  If you are using the same document to
>>> construct two different Solr documents, ManifoldCF by definition
>>> cannot be aware of this.  But if these are different files from the
>>> point of view of ManifoldCF they will have different URLs and be
>>> treated differently.  The jobs can overlap in this case with no
>>> difficulty.
>>>
>>> Karl
>>>
>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>>> <an...@gmail.com> wrote:
>>> > I want to configure two jobs to index in SOLR using ManifoldCF using
>>> > /extract/update requestHandler.
>>> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
>>> > If both these document share a unique id. Can i combine the indexes for
>>> > both
>>> > in 1 SOLR schema without overriding the details added by previous job.
>>> >
>>> > suppose,
>>> >       xmldoc indexes field0(id), field1, field2, field3
>>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>>> >
>>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
>>> > field4, field5, field6
>>> >
>>> > Regards
>>> > Anupam
>>> >
>>> >
>>
>>
>>
>>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

For Documentum, content length is in bytes, I believe.  It does not
set the length, it filters out all documents greater than the
specified length.  Leaving the field blank will perform no filtering.

Document types in Documentum are specified by mime type, so you'd want
to select all that apply.  The actual one used will depend on how your
particular instance of Documentum is configured, but if you pick them
all you should have no problem.

Karl


On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> Thanks!! Seems from your explanation that i can update same documents other
> field values. I inquired about this because I have two different document
> with a parent-child relationship which needs to be indexed as one document
> in lucene index.
>
> As you must have understood by now that i am trying to do this for
> Documentum CMS. I have seen the configuration screen for setting the Content
> length & second for filtering document type. So my question is what unit the
> Content length accepts values (bit,bytes,KB,MB etc) & whether this
> configuration set the lengths for documents full text indexing ?.
>
> Additionally to scan only one kind of document e.g PDF what should be added
> to filter those documents? is it application/pdf OR PDF ?
>
> Regards
> Anupam
>
>
> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com> wrote:
>>
>> The document key in Solr is the url of the document, as constructed by
>> the connector you are using.  If you are using the same document to
>> construct two different Solr documents, ManifoldCF by definition
>> cannot be aware of this.  But if these are different files from the
>> point of view of ManifoldCF they will have different URLs and be
>> treated differently.  The jobs can overlap in this case with no
>> difficulty.
>>
>> Karl
>>
>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
>> <an...@gmail.com> wrote:
>> > I want to configure two jobs to index in SOLR using ManifoldCF using
>> > /extract/update requestHandler.
>> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
>> > If both these document share a unique id. Can i combine the indexes for
>> > both
>> > in 1 SOLR schema without overriding the details added by previous job.
>> >
>> > suppose,
>> >       xmldoc indexes field0(id), field1, field2, field3
>> > &    pdfdoc indexes field0(id), field4, field5, field6.
>> >
>> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
>> > field4, field5, field6
>> >
>> > Regards
>> > Anupam
>> >
>> >
>
>
>
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Anupam Bhattacharya <an...@gmail.com>.

Thanks!! Seems from your explanation that i can update same documents other
field values. I inquired about this because I have two different document
with a parent-child relationship which needs to be indexed as one document
in lucene index.

As you must have understood by now that i am trying to do this for
Documentum CMS. I have seen the configuration screen for setting the
Content length & second for filtering document type. So my question is what
unit the Content length accepts values (bit,bytes,KB,MB etc) & whether this
configuration set the lengths for documents full text indexing ?.

Additionally to scan only one kind of document e.g PDF what should be added
to filter those documents? is it application/pdf OR PDF ?

Regards
Anupam

On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright <da...@gmail.com> wrote:

> The document key in Solr is the url of the document, as constructed by
> the connector you are using.  If you are using the same document to
> construct two different Solr documents, ManifoldCF by definition
> cannot be aware of this.  But if these are different files from the
> point of view of ManifoldCF they will have different URLs and be
> treated differently.  The jobs can overlap in this case with no
> difficulty.
>
> Karl
>
> On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
> <an...@gmail.com> wrote:
> > I want to configure two jobs to index in SOLR using ManifoldCF using
> > /extract/update requestHandler.
> > 1st to synchronize only XML files & 2nd to synchronize the PDF file.
> > If both these document share a unique id. Can i combine the indexes for
> both
> > in 1 SOLR schema without overriding the details added by previous job.
> >
> > suppose,
> >       xmldoc indexes field0(id), field1, field2, field3
> > &    pdfdoc indexes field0(id), field4, field5, field6.
> >
> > Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
> > field4, field5, field6
> >
> > Regards
> > Anupam
> >
> >
>

Re: Running 2 jobs to update same document Index but different fields

Posted by Karl Wright <da...@gmail.com>.

The document key in Solr is the url of the document, as constructed by
the connector you are using.  If you are using the same document to
construct two different Solr documents, ManifoldCF by definition
cannot be aware of this.  But if these are different files from the
point of view of ManifoldCF they will have different URLs and be
treated differently.  The jobs can overlap in this case with no
difficulty.

Karl

On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
<an...@gmail.com> wrote:
> I want to configure two jobs to index in SOLR using ManifoldCF using
> /extract/update requestHandler.
> 1st to synchronize only XML files & 2nd to synchronize the PDF file.
> If both these document share a unique id. Can i combine the indexes for both
> in 1 SOLR schema without overriding the details added by previous job.
>
> suppose,
>       xmldoc indexes field0(id), field1, field2, field3
> &    pdfdoc indexes field0(id), field4, field5, field6.
>
> Output docindex ==> (xml+pdf doc), field0(id), field1, field2, field3,
> field4, field5, field6
>
> Regards
> Anupam
>
>