You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Matteo Grolla <m....@sourcesense.com> on 2014/06/13 17:48:32 UTC

questions emerged designing a connector to index solrxml documents

Hi,
	I'd like to develop a connector to index solr xml documents to a solr instance. By the way I'm absolutely willing to contribute the code.
I have a few questions that I hope you can answer.

I'm starting from the filesystem connector, since it seems the most similar
A big difference though is that now a single file can represent many documents.

How can I handle this efficiently?
Suppose I leave the seeding phase as the filesystem connector (getDocumentIdentifiers() method)
in the docProcessing phase (processDocuments() method) I:
1)obtain a filepath
2)parse the xml file
3)seed the ids of the solr documents and add a child relation from those ids to the file path.
	Ex. I seed the identifier "hd-samsung-500GB" which identifies one of the documents contained in the files "/toIndex/hd.xml"
		let's pretend that hd.xml contains 50 solr documents
4)when manifold calls processDocuments() with the identifier "hd-samsung-500GB" 
	I could follow the parent relation to "/toIndex/hd.xml"
	reparse the file
	create a RepositoryDocument using the information related to "hd-samsung-500GB" 
	ingest this RepositoryDocument
…
but this would be a very wasteful approach

Ideally I'd like to parse the xml file only once

I was thinking I could do what follows in the seeding phase
	parse the file 
	create a RepositoryDocument for every solrdocument
	serialize them in the document identifier
…
but I think this would make really ugly identifiers in the status reports
what do you think? Is there a better way to do it?

Another thing that confuses me is how (manifold) documents change state
Ex. 
	In the filesystem connector I crawl 1 directory with 1 file
	afterwards I look at the document status report and see that both the directory and the file have state "processed"
	the document has been ingested so I think the ingest method caused the status change
	what method caused the state change for the directory?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Re: questions emerged designing a connector to index solrxml documents

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi

Karl made the book publicly available.
You can access the book : https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/

Ahmet


On Friday, June 13, 2014 7:36 PM, Matteo Grolla <m....@sourcesense.com> wrote:
Really thanks again
I'm figuring out how it works.

By the way: 
    I bought ManifoldCF in Action
    great documentation!!!


-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com




Il giorno 13/giu/2014, alle ore 18:29, Karl Wright ha scritto:

> Hi Matteo,
> 
> The framework will take care of the state change.  You do not try to do
> that within the connector.  All you do is process the document(s) that are
> handed to you.
> 
> So, for example, if you have the following document identifiers:
> 
> /toIndex/hd.xml (identifiable as a file)
> /toIndex/hd.xml:0 (first document within hd.xml)
> /toIndex/hd.xml:1 (second document within hd.xml)
> 
> etc.
> 
> Then, if you see a processDocuments() request for "/toIndex/hd.xml", you
> pick up the XML and parse it, calling IProcessActivity.addReference() for
> each solr document within (and you construct the document identifier too
> during the same pass, and the carrydown content information you extract).
> If you see a processDocuments() request for /toIndex/hd.xml:0, then you
> simply pick up the content that is passed to you in the carrydown, and call
> activities.ingestDocument() with it.
> 
> States do not *ever* come into connector design; the framework always takes
> care of that.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
> 
>> thanks very much Karl
>> 
>> Can you also respond to the part regarding the state change?
>> In the filesystem connector I don't see a method call that could change
>> the state of the directory to processed
>> I was thinking that
>>        if processDocuments() is called with the identifier
>> "/toIndex/hd.xml"
>>        and there are no exceptions
>>        this could be enough to put "/toIndex/hd.xml" in state "processed"
>>        am I right?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:
>> 
>>> HI Matteo,
>>> 
>>> What I'd recommend is that you create a document identifier for each solr
>>> document, and a different kind of document identifier for each xml file.
>>> The xml file would then be like a "directory", and the solr document
>> would
>>> be like the "file".  You then can use carry-down support to allow the xml
>>> file to be parsed only once.  A similar approach is used for the RSS
>>> connector.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <
>> m.grolla@sourcesense.com>
>>> wrote:
>>> 
>>>> Hi,
>>>>       I'd like to develop a connector to index solr xml documents to a
>>>> solr instance. By the way I'm absolutely willing to contribute the code.
>>>> I have a few questions that I hope you can answer.
>>>> 
>>>> I'm starting from the filesystem connector, since it seems the most
>> similar
>>>> A big difference though is that now a single file can represent many
>>>> documents.
>>>> 
>>>> How can I handle this efficiently?
>>>> Suppose I leave the seeding phase as the filesystem connector
>>>> (getDocumentIdentifiers() method)
>>>> in the docProcessing phase (processDocuments() method) I:
>>>> 1)obtain a filepath
>>>> 2)parse the xml file
>>>> 3)seed the ids of the solr documents and add a child relation from those
>>>> ids to the file path.
>>>>       Ex. I seed the identifier "hd-samsung-500GB" which identifies one
>>>> of the documents contained in the files "/toIndex/hd.xml"
>>>>               let's pretend that hd.xml contains 50 solr documents
>>>> 4)when manifold calls processDocuments() with the identifier
>>>> "hd-samsung-500GB"
>>>>       I could follow the parent relation to "/toIndex/hd.xml"
>>>>       reparse the file
>>>>       create a RepositoryDocument using the information related to
>>>> "hd-samsung-500GB"
>>>>       ingest this RepositoryDocument
>>>> …
>>>> but this would be a very wasteful approach
>>>> 
>>>> Ideally I'd like to parse the xml file only once
>>>> 
>>>> I was thinking I could do what follows in the seeding phase
>>>>       parse the file
>>>>       create a RepositoryDocument for every solrdocument
>>>>       serialize them in the document identifier
>>>> …
>>>> but I think this would make really ugly identifiers in the status
>> reports
>>>> what do you think? Is there a better way to do it?
>>>> 
>>>> Another thing that confuses me is how (manifold) documents change state
>>>> Ex.
>>>>       In the filesystem connector I crawl 1 directory with 1 file
>>>>       afterwards I look at the document status report and see that both
>>>> the directory and the file have state "processed"
>>>>       the document has been ingested so I think the ingest method
>> caused
>>>> the status change
>>>>       what method caused the state change for the directory?
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> 
>> 
>>

Re: questions emerged designing a connector to index solrxml documents

Posted by Matteo Grolla <m....@sourcesense.com>.

Really thanks again
I'm figuring out how it works.

By the way: 
	I bought ManifoldCF in Action
	great documentation!!!


-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 18:29, Karl Wright ha scritto:

> Hi Matteo,
> 
> The framework will take care of the state change.  You do not try to do
> that within the connector.  All you do is process the document(s) that are
> handed to you.
> 
> So, for example, if you have the following document identifiers:
> 
> /toIndex/hd.xml (identifiable as a file)
> /toIndex/hd.xml:0 (first document within hd.xml)
> /toIndex/hd.xml:1 (second document within hd.xml)
> 
> etc.
> 
> Then, if you see a processDocuments() request for "/toIndex/hd.xml", you
> pick up the XML and parse it, calling IProcessActivity.addReference() for
> each solr document within (and you construct the document identifier too
> during the same pass, and the carrydown content information you extract).
> If you see a processDocuments() request for /toIndex/hd.xml:0, then you
> simply pick up the content that is passed to you in the carrydown, and call
> activities.ingestDocument() with it.
> 
> States do not *ever* come into connector design; the framework always takes
> care of that.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
> 
>> thanks very much Karl
>> 
>> Can you also respond to the part regarding the state change?
>> In the filesystem connector I don't see a method call that could change
>> the state of the directory to processed
>> I was thinking that
>>        if processDocuments() is called with the identifier
>> "/toIndex/hd.xml"
>>        and there are no exceptions
>>        this could be enough to put "/toIndex/hd.xml" in state "processed"
>>        am I right?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:
>> 
>>> HI Matteo,
>>> 
>>> What I'd recommend is that you create a document identifier for each solr
>>> document, and a different kind of document identifier for each xml file.
>>> The xml file would then be like a "directory", and the solr document
>> would
>>> be like the "file".  You then can use carry-down support to allow the xml
>>> file to be parsed only once.  A similar approach is used for the RSS
>>> connector.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <
>> m.grolla@sourcesense.com>
>>> wrote:
>>> 
>>>> Hi,
>>>>       I'd like to develop a connector to index solr xml documents to a
>>>> solr instance. By the way I'm absolutely willing to contribute the code.
>>>> I have a few questions that I hope you can answer.
>>>> 
>>>> I'm starting from the filesystem connector, since it seems the most
>> similar
>>>> A big difference though is that now a single file can represent many
>>>> documents.
>>>> 
>>>> How can I handle this efficiently?
>>>> Suppose I leave the seeding phase as the filesystem connector
>>>> (getDocumentIdentifiers() method)
>>>> in the docProcessing phase (processDocuments() method) I:
>>>> 1)obtain a filepath
>>>> 2)parse the xml file
>>>> 3)seed the ids of the solr documents and add a child relation from those
>>>> ids to the file path.
>>>>       Ex. I seed the identifier "hd-samsung-500GB" which identifies one
>>>> of the documents contained in the files "/toIndex/hd.xml"
>>>>               let's pretend that hd.xml contains 50 solr documents
>>>> 4)when manifold calls processDocuments() with the identifier
>>>> "hd-samsung-500GB"
>>>>       I could follow the parent relation to "/toIndex/hd.xml"
>>>>       reparse the file
>>>>       create a RepositoryDocument using the information related to
>>>> "hd-samsung-500GB"
>>>>       ingest this RepositoryDocument
>>>> …
>>>> but this would be a very wasteful approach
>>>> 
>>>> Ideally I'd like to parse the xml file only once
>>>> 
>>>> I was thinking I could do what follows in the seeding phase
>>>>       parse the file
>>>>       create a RepositoryDocument for every solrdocument
>>>>       serialize them in the document identifier
>>>> …
>>>> but I think this would make really ugly identifiers in the status
>> reports
>>>> what do you think? Is there a better way to do it?
>>>> 
>>>> Another thing that confuses me is how (manifold) documents change state
>>>> Ex.
>>>>       In the filesystem connector I crawl 1 directory with 1 file
>>>>       afterwards I look at the document status report and see that both
>>>> the directory and the file have state "processed"
>>>>       the document has been ingested so I think the ingest method
>> caused
>>>> the status change
>>>>       what method caused the state change for the directory?
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> 
>> 
>>

Re: questions emerged designing a connector to index solrxml documents

Posted by Karl Wright <da...@gmail.com>.

Hi Matteo,

The framework will take care of the state change.  You do not try to do
that within the connector.  All you do is process the document(s) that are
handed to you.

So, for example, if you have the following document identifiers:

/toIndex/hd.xml (identifiable as a file)
/toIndex/hd.xml:0 (first document within hd.xml)
/toIndex/hd.xml:1 (second document within hd.xml)

etc.

Then, if you see a processDocuments() request for "/toIndex/hd.xml", you
pick up the XML and parse it, calling IProcessActivity.addReference() for
each solr document within (and you construct the document identifier too
during the same pass, and the carrydown content information you extract).
If you see a processDocuments() request for /toIndex/hd.xml:0, then you
simply pick up the content that is passed to you in the carrydown, and call
activities.ingestDocument() with it.

States do not *ever* come into connector design; the framework always takes
care of that.

Thanks,
Karl



On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <m....@sourcesense.com>
wrote:

> thanks very much Karl
>
> Can you also respond to the part regarding the state change?
> In the filesystem connector I don't see a method call that could change
> the state of the directory to processed
> I was thinking that
>         if processDocuments() is called with the identifier
> "/toIndex/hd.xml"
>         and there are no exceptions
>         this could be enough to put "/toIndex/hd.xml" in state "processed"
>         am I right?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:
>
> > HI Matteo,
> >
> > What I'd recommend is that you create a document identifier for each solr
> > document, and a different kind of document identifier for each xml file.
> > The xml file would then be like a "directory", and the solr document
> would
> > be like the "file".  You then can use carry-down support to allow the xml
> > file to be parsed only once.  A similar approach is used for the RSS
> > connector.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <
> m.grolla@sourcesense.com>
> > wrote:
> >
> >> Hi,
> >>        I'd like to develop a connector to index solr xml documents to a
> >> solr instance. By the way I'm absolutely willing to contribute the code.
> >> I have a few questions that I hope you can answer.
> >>
> >> I'm starting from the filesystem connector, since it seems the most
> similar
> >> A big difference though is that now a single file can represent many
> >> documents.
> >>
> >> How can I handle this efficiently?
> >> Suppose I leave the seeding phase as the filesystem connector
> >> (getDocumentIdentifiers() method)
> >> in the docProcessing phase (processDocuments() method) I:
> >> 1)obtain a filepath
> >> 2)parse the xml file
> >> 3)seed the ids of the solr documents and add a child relation from those
> >> ids to the file path.
> >>        Ex. I seed the identifier "hd-samsung-500GB" which identifies one
> >> of the documents contained in the files "/toIndex/hd.xml"
> >>                let's pretend that hd.xml contains 50 solr documents
> >> 4)when manifold calls processDocuments() with the identifier
> >> "hd-samsung-500GB"
> >>        I could follow the parent relation to "/toIndex/hd.xml"
> >>        reparse the file
> >>        create a RepositoryDocument using the information related to
> >> "hd-samsung-500GB"
> >>        ingest this RepositoryDocument
> >> …
> >> but this would be a very wasteful approach
> >>
> >> Ideally I'd like to parse the xml file only once
> >>
> >> I was thinking I could do what follows in the seeding phase
> >>        parse the file
> >>        create a RepositoryDocument for every solrdocument
> >>        serialize them in the document identifier
> >> …
> >> but I think this would make really ugly identifiers in the status
> reports
> >> what do you think? Is there a better way to do it?
> >>
> >> Another thing that confuses me is how (manifold) documents change state
> >> Ex.
> >>        In the filesystem connector I crawl 1 directory with 1 file
> >>        afterwards I look at the document status report and see that both
> >> the directory and the file have state "processed"
> >>        the document has been ingested so I think the ingest method
> caused
> >> the status change
> >>        what method caused the state change for the directory?
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >>
>
>

Re: questions emerged designing a connector to index solrxml documents

Posted by Matteo Grolla <m....@sourcesense.com>.

thanks very much Karl

Can you also respond to the part regarding the state change?
In the filesystem connector I don't see a method call that could change the state of the directory to processed
I was thinking that 
	if processDocuments() is called with the identifier "/toIndex/hd.xml" 
	and there are no exceptions
	this could be enough to put "/toIndex/hd.xml" in state "processed"
	am I right?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:

> HI Matteo,
> 
> What I'd recommend is that you create a document identifier for each solr
> document, and a different kind of document identifier for each xml file.
> The xml file would then be like a "directory", and the solr document would
> be like the "file".  You then can use carry-down support to allow the xml
> file to be parsed only once.  A similar approach is used for the RSS
> connector.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <m....@sourcesense.com>
> wrote:
> 
>> Hi,
>>        I'd like to develop a connector to index solr xml documents to a
>> solr instance. By the way I'm absolutely willing to contribute the code.
>> I have a few questions that I hope you can answer.
>> 
>> I'm starting from the filesystem connector, since it seems the most similar
>> A big difference though is that now a single file can represent many
>> documents.
>> 
>> How can I handle this efficiently?
>> Suppose I leave the seeding phase as the filesystem connector
>> (getDocumentIdentifiers() method)
>> in the docProcessing phase (processDocuments() method) I:
>> 1)obtain a filepath
>> 2)parse the xml file
>> 3)seed the ids of the solr documents and add a child relation from those
>> ids to the file path.
>>        Ex. I seed the identifier "hd-samsung-500GB" which identifies one
>> of the documents contained in the files "/toIndex/hd.xml"
>>                let's pretend that hd.xml contains 50 solr documents
>> 4)when manifold calls processDocuments() with the identifier
>> "hd-samsung-500GB"
>>        I could follow the parent relation to "/toIndex/hd.xml"
>>        reparse the file
>>        create a RepositoryDocument using the information related to
>> "hd-samsung-500GB"
>>        ingest this RepositoryDocument
>> …
>> but this would be a very wasteful approach
>> 
>> Ideally I'd like to parse the xml file only once
>> 
>> I was thinking I could do what follows in the seeding phase
>>        parse the file
>>        create a RepositoryDocument for every solrdocument
>>        serialize them in the document identifier
>> …
>> but I think this would make really ugly identifiers in the status reports
>> what do you think? Is there a better way to do it?
>> 
>> Another thing that confuses me is how (manifold) documents change state
>> Ex.
>>        In the filesystem connector I crawl 1 directory with 1 file
>>        afterwards I look at the document status report and see that both
>> the directory and the file have state "processed"
>>        the document has been ingested so I think the ingest method caused
>> the status change
>>        what method caused the state change for the directory?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>>

Re: questions emerged designing a connector to index solrxml documents

Posted by Karl Wright <da...@gmail.com>.

HI Matteo,

What I'd recommend is that you create a document identifier for each solr
document, and a different kind of document identifier for each xml file.
The xml file would then be like a "directory", and the solr document would
be like the "file".  You then can use carry-down support to allow the xml
file to be parsed only once.  A similar approach is used for the RSS
connector.

Thanks,
Karl



On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <m....@sourcesense.com>
wrote:

> Hi,
>         I'd like to develop a connector to index solr xml documents to a
> solr instance. By the way I'm absolutely willing to contribute the code.
> I have a few questions that I hope you can answer.
>
> I'm starting from the filesystem connector, since it seems the most similar
> A big difference though is that now a single file can represent many
> documents.
>
> How can I handle this efficiently?
> Suppose I leave the seeding phase as the filesystem connector
> (getDocumentIdentifiers() method)
> in the docProcessing phase (processDocuments() method) I:
> 1)obtain a filepath
> 2)parse the xml file
> 3)seed the ids of the solr documents and add a child relation from those
> ids to the file path.
>         Ex. I seed the identifier "hd-samsung-500GB" which identifies one
> of the documents contained in the files "/toIndex/hd.xml"
>                 let's pretend that hd.xml contains 50 solr documents
> 4)when manifold calls processDocuments() with the identifier
> "hd-samsung-500GB"
>         I could follow the parent relation to "/toIndex/hd.xml"
>         reparse the file
>         create a RepositoryDocument using the information related to
> "hd-samsung-500GB"
>         ingest this RepositoryDocument
> …
> but this would be a very wasteful approach
>
> Ideally I'd like to parse the xml file only once
>
> I was thinking I could do what follows in the seeding phase
>         parse the file
>         create a RepositoryDocument for every solrdocument
>         serialize them in the document identifier
> …
> but I think this would make really ugly identifiers in the status reports
> what do you think? Is there a better way to do it?
>
> Another thing that confuses me is how (manifold) documents change state
> Ex.
>         In the filesystem connector I crawl 1 directory with 1 file
>         afterwards I look at the document status report and see that both
> the directory and the file have state "processed"
>         the document has been ingested so I think the ingest method caused
> the status change
>         what method caused the state change for the directory?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
>