You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Matteo Grolla <m....@sourcesense.com> on 2014/06/16 16:35:55 UTC

Solr Extracting request handler

Hi During my first indexing I noticed that manifold uses Solr extracting request handler to extract the content of an xml file
For performance reasons it would be better if Manifold handled the extraction letting Solr do the search engine
Is this because of the connector design, framework design or just to be done?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

Hi Karl,
I am proceeding modifying the Solr connector introducing a new flag that
will control the operative mode :
1) using Extract Update handler ( as it is right now)
2) using the SolrInputDocument and classic SolrJ add.

I will introduce a flag checkbox as we did for the "keepAllMetadata" .

Is already an issue for that karl?
Let me know!

Cheers


2014-06-18 16:10 GMT+01:00 Karl Wright <da...@gmail.com>:

> Hi Alessandro,
>
> The reason for backwards compatibility is obvious: people upgrade
> ManifoldCF all the time, and when they do it should not stop working for
> them.
>
> Putting Tika all the time in the pipeline is also not appropriate for other
> output connections.  Even if you did it just for Solr, you'd then have to
> insure that the Tika transformer was exactly compatible with Solr Cell,
> which I would be very uncomfortable with agreeing to.
>
> So let's presume that you'd do one of two things.  Either:
>
> - Leave the existing Solr connector alone, and create a whole new Solr
> connector designed to work with a Tika transformer, or
> - Modify the existing Solr connector so that it operates in two possible
> modes, one of which supports the legacy model (the default), and one of
> which supports your new model
>
> If this sounds overly burdensome, I'm sorry but it's necessary until MCF
> 2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
> backwards compatibility, including legacy tabs that have outlived their
> usefulness, etc.  But that's not a 1.7 solution.
>
> Karl
>
>
>
> On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > Hello Karl,
> > What i was thinking is:
> > assuming we have the Tika Connector, the responsibility to extract
> content
> > will pass from Solr to the Tika processor.
> >
> > So we can change the part in the Solr Connector that manages the building
> > of the request to send to the Extract update handler.
> > Particularly that part will change in the classic way: usually it's good
> to
> > build a SolrDocument in SolrJ and then add it to SolrServer.
> >
> > Why should we give retrocompatibility from Solr Connector point of view ?
> > From the user point of view, a Job will be selected with the Tika
> Conenctor
> > in the pipeline, so we are providing the same identical feature.
> > One way can be to make the Tika Processor Connector by default in the
> > pipeline, and someone will be able to deactivate it only if needed.
> >
> > Cheers
> >
> >
> >
> > 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
> >
> > > Hi Alessandro,
> > > What is your concrete proposal to change the Solr connector?  Bear in
> > mind
> > > that we do need to maintain backwards compatibility.  If you list your
> > > specific changes, not in any huge detail, but with enough detail that
> we
> > > understand your proposal, that would help.  What happens to the UI?
>  What
> > > happens to the internals?
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > >
> > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> > > benedetti.alex85@gmail.com> wrote:
> > >
> > > > But guys, why not simply pass to a classic SolrJ SolrDocument
> creation
> > > and
> > > > ingestion in the Solr Server ? Easy and Straighforward !
> > > >
> > > > In the end at that point the RepositoryDocument will me only a Map of
> > > > metadata and values.
> > > > Content will be part of that, so I guess the conversion to a
> > SolrDocument
> > > > will be immediate.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
> > > >
> > > > > Hi Abe-san,
> > > > >
> > > > > Near as I can tell, the major consumer of disk space is the Maven
> > > target
> > > > > directories.  This is generating many tens of megabytes of
> temporary
> > > disk
> > > > > usage for every connector.  Luckily if you use ant, this is not a
> > > > problem.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Abe-san,
> > > > > >
> > > > > > Tika jars are not very big:
> > > > > >
> > > > > > C:\wip\mcf\trunk\lib>dir tika*
> > > > > >  Volume in drive C has no label.
> > > > > >  Volume Serial Number is 002E-D1F0
> > > > > >
> > > > > >  Directory of C:\wip\mcf\trunk\lib
> > > > > >
> > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > > > >                2 File(s)      1,017,051 bytes
> > > > > >                0 Dir(s)  140,792,315,904 bytes free
> > > > > >
> > > > > > The entire lib directory is 85M:
> > > > > >
> > > > > > 85,156,330 bytes
> > > > > >
> > > > > > The built binary image is still about 185Mb, I believe.  So I
> don't
> > > > know
> > > > > > why you think it is >1Gb?  Temporary class files?  I don't think
> we
> > > can
> > > > > > avoid those.
> > > > > >
> > > > > > I'd rather not make things more complicated than they need to be
> by
> > > > > adding
> > > > > > a new required service - even though it would fit naturally with
> > the
> > > > > > connector arrangement.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > > >
> > > > > >> Hi Karl,
> > > > > >>
> > > > > >> Okay, I assumed Tika connector outputs files.
> > > > > >> If we post character data metadata got from Tika,
> > "/update/extract"
> > > > > >> handler
> > > > > >> can handle this(provides params:
> > > > > >> literal.content=value&literal.metaField=foobar
> > > > > >> with using NullInputStream for binary data like CONNECTORS-936).
> > > > > >>
> > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
> > CloudSearch
> > > > > >> connector uses Tika jars.
> > > > > >> Tika connector and CloudSearch connector should extract text via
> > > > > >> tika-server[1]
> > > > > >> and MCF should not have many Tika jars, do you think?
> > > > > >>
> > > > > >> [1]
> > > > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Shinichiro Abe
> > > > > >>
> > > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> > > > > >>
> > > > > >> > Hi Abe-san,
> > > > > >> >
> > > > > >> > It sounds like you might be thinking that transformation
> > > connectors
> > > > > are
> > > > > >> > like output connectors.  Just so we are clear, transformation
> > > > > >> connectors in
> > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > > > >> > RepositoryDocument on to the next connector in the chain.  So
> I
> > > > don't
> > > > > >> know
> > > > > >> > why .xml files would be involved.  I'd expect the Tika
> connector
> > > to
> > > > > >> read a
> > > > > >> > binary file from one RepositoryDocument object and convert its
> > > > > contents
> > > > > >> to
> > > > > >> > another RepositoryDocument object which would have character
> > data
> > > > and
> > > > > >> > metadata only.  Would this work for your case, do you think?
> > > > > >> >
> > > > > >> > Karl
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > > > >> shinichiro.abe.1@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Hi Karl,
> > > > > >> >>
> > > > > >> >> Yes. I thought the standard update handler met that
> > requirement.
> > > > > >> >> For instance, Tika extractor transformation connector creates
> > two
> > > > > >> files.
> > > > > >> >> 1. addtoSolr.xml for add and update
> > > > > >> >> 2. deletetoSolr.xml for delete
> > > > > >> >> File connector ingests these xml files, then Solr connector
> > posts
> > > > > these
> > > > > >> >> files by "/update" handler.
> > > > > >> >>
> > > > > >> >> In the the Solr Connector, other function as to update
> handler
> > > > > >> >> might not be necessary except for  "/update" handler.
> > > > > >> >>
> > > > > >> >> Thanks,
> > > > > >> >> Shinichiro Abe
> > > > > >> >>
> > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
> > wrote:
> > > > > >> >>
> > > > > >> >>> Hi Abe-san,
> > > > > >> >>>
> > > > > >> >>> So just to be sure -- you believe that no changes at all are
> > > > > required
> > > > > >> to
> > > > > >> >>> the Solr Connector as it stands now, other than to use the
> > > update
> > > > > >> handler
> > > > > >> >>> rather than the /update/extract handler?
> > > > > >> >>>
> > > > > >> >>> Karl
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > > > >> >> shinichiro.abe.1@gmail.com>
> > > > > >> >>> wrote:
> > > > > >> >>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >> extracting
> > > > > >> >>>> update handler
> > > > > >> >>>>
> > > > > >> >>>> I don't think it needs to change Solr connector with new
> > > checkbox
> > > > > >> >> because
> > > > > >> >>>> currently we can change "/update/extract" into "/update" at
> > > > 'Update
> > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I
> > could
> > > > > post
> > > > > >> >> CSV,
> > > > > >> >>>> JSON and XML files to Solr by changing that and using File
> > > > > connector.
> > > > > >> >> So I
> > > > > >> >>>> wish we allow Tika extractor transformation connector to
> > create
> > > > XML
> > > > > >> >> files
> > > > > >> >>>> that Solr expects to see.
> > > > > >> >>>>
> > > > > >> >>>> Regards,
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com
> >:
> > > > > >> >>>>
> > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
>  Zaizi
> > > said
> > > > > >> they'd
> > > > > >> >>>>> contribute a Tika extractor transformation connector - and
> > if
> > > > they
> > > > > >> >> don't
> > > > > >> >>>>> get around to that in a month or so, I may take a crack at
> > it
> > > > > >> myself.
> > > > > >> >>>>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >>>> extracting
> > > > > >> >>>>> update handler, it would be great if:
> > > > > >> >>>>> (1) Someone created a ticket for this, and
> > > > > >> >>>>> (2) A patch was provided that maintains backwards
> > > compatibility
> > > > > with
> > > > > >> >>>>> previous versions of the connector (so a checkbox would
> > > probably
> > > > > >> need
> > > > > >> >> to
> > > > > >> >>>> go
> > > > > >> >>>>> into the UI somewhere).  Do either of you want to start
> this
> > > > > >> process?
> > > > > >> >>>>>
> > > > > >> >>>>> Thanks!
> > > > > >> >>>>> Karl
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > > > daddywri@gmail.com
> > > > > >
> > > > > >> >>>> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>> Hi guys,
> > > > > >> >>>>>>
> > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a
> full
> > > > > >> pipeline,
> > > > > >> >>>> and
> > > > > >> >>>>>> is expected to have a Tika extractor as a transformation
> > > > > connector.
> > > > > >> >>>>>>
> > > > > >> >>>>>> Karl
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > > > >> >>>>> m.grolla@sourcesense.com>
> > > > > >> >>>>>> wrote:
> > > > > >> >>>>>>
> > > > > >> >>>>>>> Thanks Alessandro,
> > > > > >> >>>>>>>       that explains the situation clearly.
> > > > > >> >>>>>>> And I agree that sending all the metadata as get
> parameter
> > > can
> > > > > be
> > > > > >> >>>>>>> problematic
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Cheers
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> --
> > > > > >> >>>>>>> Matteo Grolla
> > > > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
> > Benedetti
> > > ha
> > > > > >> >>>> scritto:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > > > extractors.
> > > > > >> >>>>>>>> The Repository connectors extracts directly the binary
> > and
> > > > > there
> > > > > >> is
> > > > > >> >>>> no
> > > > > >> >>>>>>>> "Extractor Processor" yet.
> > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
> been
> > > > > thought
> > > > > >> (
> > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > > > > >> >>>>>>>> So can fit there.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Cheers
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > > > >> m.grolla@sourcesense.com
> > > > > >> >>>>> :
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>> Since Solr extracting request handler takes the binary
> > and
> > > > > >> extracts
> > > > > >> >>>>>>> text
> > > > > >> >>>>>>>>> what is the point of not using Manifold extractor and
> > send
> > > > > text
> > > > > >> and
> > > > > >> >>>>>>>>> binaries to solr?
> > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text
> and
> > > > stores
> > > > > >> text
> > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me
> this
> > > is
> > > > > the
> > > > > >> >>>> place
> > > > > >> >>>>>>>>> where it should be done
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> --
> > > > > >> >>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David
> > Perez
> > > > > >> Morales
> > > > > >> >>>> ha
> > > > > >> >>>>>>>>> scritto:
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>> Hi Matteo
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the only
> > way
> > > > to
> > > > > >> send
> > > > > >> >>>>>>> binary
> > > > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > > > >> update/extract
> > > > > >> >>>>>>>>> handler,
> > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and
> the
> > > > binary
> > > > > >> >>>>> content
> > > > > >> >>>>>>> is
> > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to
> use
> > > Tika
> > > > > to
> > > > > >> >>>>> obtain
> > > > > >> >>>>>>> the
> > > > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Regards
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > > > >> >>>>>>> m.grolla@sourcesense.com
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> wrote:
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold
> > uses
> > > > > Solr
> > > > > >> >>>>>>> extracting
> > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
> file
> > > > > >> >>>>>>>>>>> For performance reasons it would be better if
> Manifold
> > > > > handled
> > > > > >> >>>> the
> > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > > > >> >>>>>>>>>>> Is this because of the connector design, framework
> > > design
> > > > or
> > > > > >> just
> > > > > >> >>>>> to
> > > > > >> >>>>>>> be
> > > > > >> >>>>>>>>>>> done?
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> --
> > > > > >> >>>>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> --
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> ------------------------------
> > > > > >> >>>>>>>>>> This message should be regarded as confidential. If
> you
> > > > have
> > > > > >> >>>>> received
> > > > > >> >>>>>>>>> this
> > > > > >> >>>>>>>>>> email in error please notify the sender and destroy
> it
> > > > > >> >>>> immediately.
> > > > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > > > confirmed
> > > > > >> in
> > > > > >> >>>>> hard
> > > > > >> >>>>>>>>> copy
> > > > > >> >>>>>>>>>> by an authorised signatory.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > > > > >> registration
> > > > > >> >>>>>>> number
> > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > > > Shepherds
> > > > > >> Bush
> > > > > >> >>>>>>> Road,
> > > > > >> >>>>>>>>>> London W6 7AN.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> --
> > > > > >> >>>>>>>> --------------------------
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Benedetti Alessandro
> > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> "Tyger, tyger burning bright
> > > > > >> >>>>>>>> In the forests of the night,
> > > > > >> >>>>>>>> What immortal hand or eye
> > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> --
> > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > - -
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>> 阿部 慎一朗
> > > > > >> >>>>
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

A tika extractor transformer has been coded up and committed.

The transformer takes binary and converts it to metadata and UTF-8 text
(replacing the standard binary stream).

Karl



On Wed, Jun 18, 2014 at 1:41 PM, Karl Wright <da...@gmail.com> wrote:

> Since a Tika transformer is critical to this plan, I'm going to code one
> up now.  Stay tuned!
> Karl
>
>
> On Wed, Jun 18, 2014 at 11:59 AM, Karl Wright <da...@gmail.com> wrote:
>
>> bq. I don't agree on this. Why is not appropriate for all the connectors ?
>>
>> Some output connectors want the document in binary form -- e.g. the HDFS
>> and  FileSystem connectors, which don't deal with metadata at all.  It's
>> not clear whether the Tika transformer would preserve the binary stream, or
>> would replace the binary stream with an extracted content stream.  I'd
>> kind-of expect the latter, but there are other ways to do it, of course.
>> But it would certainly impact performance, so it should not be a
>> requirement.  Not only that, but there's no *reason* to make it a
>> requirement, since you can very readily add it or remove it from the
>> pipeline in the UI.
>>
>> bq. So what is the problem of using Tika outside Solr?
>>
>> We've seen a number of cases where Tika inside Solr does things based on
>> (for instance) http headers that Solr receives.  Abe-san had some
>> difficulty with that a while back.  We had to repeatedly fix things when we
>> went to SolrJ to make sure various headers were compatible so that SolrCell
>> worked the same.  I'd rather not re-implement SolrCell precisely in
>> ManifoldCF if I can help it.
>>
>> bq. Solr Extract is using Tika under the hood, nothing more.
>>
>> It's more complicated than that.  Have a look at the code.
>>
>> bq. probably a simple flag can fit to operate in one way or another.
>>
>> I agree that that should be sufficient.
>>
>> Karl
>>
>>
>>
>> On Wed, Jun 18, 2014 at 11:35 AM, Alessandro Benedetti <
>> benedetti.alex85@gmail.com> wrote:
>>
>>> 2014-06-18 16:10 GMT+01:00 Karl Wright <da...@gmail.com>:
>>>
>>> > Hi Alessandro,
>>> >
>>> > The reason for backwards compatibility is obvious: people upgrade
>>> > ManifoldCF all the time, and when they do it should not stop working
>>> for
>>> > them.
>>> >
>>> Ok i agree !
>>>
>>> >
>>> > Putting Tika all the time in the pipeline is also not appropriate for
>>> other
>>> > output connections.
>>>
>>>
>>> I don't agree on this. Why is not appropriate for all the connectors ?
>>> The conceptual responsibility of an output Connector should be to post a
>>> RespositoryDocument to an output ( whatever we want) .
>>> A RepositoryDocument is a map Field-> value.
>>> The content is nothing than a one of these fields.
>>> So I can not see why after we have a RepositoryDocument ( with content
>>> extracted) , should not be possible to send it independently to any
>>> OutputConnector.
>>>
>>>
>>> >  Even if you did it just for Solr, you'd then have to
>>> > insure that the Tika transformer was exactly compatible with Solr Cell,
>>> > which I would be very uncomfortable with agreeing to.
>>> >
>>>
>>> So what is the problem of using Tika outside Solr? We will add the most
>>> recent version of Tika, that will be gradually upgraded over time with
>>> the
>>> platform.
>>>
>>> Solr Extract is using Tika under the hood, nothing more.
>>>
>>>
>>>
>>> > So let's presume that you'd do one of two things.  Either:
>>> >
>>> > - Leave the existing Solr connector alone, and create a whole new Solr
>>> > connector designed to work with a Tika transformer, or
>>> > - Modify the existing Solr connector so that it operates in two
>>> possible
>>> > modes, one of which supports the legacy model (the default), and one of
>>> > which supports your new model
>>> >
>>>
>>> probably a simple flag can fit to operate in one way or another.
>>>
>>> >
>>> > If this sounds overly burdensome, I'm sorry but it's necessary until
>>> MCF
>>> > 2.0.  For MCF 2.0, which I've begun to think about, we can dispense
>>> with
>>> > backwards compatibility, including legacy tabs that have outlived their
>>> > usefulness, etc.  But that's not a 1.7 solution.
>>> >
>>> > Karl
>>> >
>>>
>>> Cheers
>>>
>>> >
>>> >
>>> >
>>> > On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
>>> > benedetti.alex85@gmail.com> wrote:
>>> >
>>> > > Hello Karl,
>>> > > What i was thinking is:
>>> > > assuming we have the Tika Connector, the responsibility to extract
>>> > content
>>> > > will pass from Solr to the Tika processor.
>>> > >
>>> > > So we can change the part in the Solr Connector that manages the
>>> building
>>> > > of the request to send to the Extract update handler.
>>> > > Particularly that part will change in the classic way: usually it's
>>> good
>>> > to
>>> > > build a SolrDocument in SolrJ and then add it to SolrServer.
>>> > >
>>> > > Why should we give retrocompatibility from Solr Connector point of
>>> view ?
>>> > > From the user point of view, a Job will be selected with the Tika
>>> > Conenctor
>>> > > in the pipeline, so we are providing the same identical feature.
>>> > > One way can be to make the Tika Processor Connector by default in the
>>> > > pipeline, and someone will be able to deactivate it only if needed.
>>> > >
>>> > > Cheers
>>> > >
>>> > >
>>> > >
>>> > > 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
>>> > >
>>> > > > Hi Alessandro,
>>> > > > What is your concrete proposal to change the Solr connector?  Bear
>>> in
>>> > > mind
>>> > > > that we do need to maintain backwards compatibility.  If you list
>>> your
>>> > > > specific changes, not in any huge detail, but with enough detail
>>> that
>>> > we
>>> > > > understand your proposal, that would help.  What happens to the UI?
>>> >  What
>>> > > > happens to the internals?
>>> > > >
>>> > > > Thanks,
>>> > > > Karl
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
>>> > > > benedetti.alex85@gmail.com> wrote:
>>> > > >
>>> > > > > But guys, why not simply pass to a classic SolrJ SolrDocument
>>> > creation
>>> > > > and
>>> > > > > ingestion in the Solr Server ? Easy and Straighforward !
>>> > > > >
>>> > > > > In the end at that point the RepositoryDocument will me only a
>>> Map of
>>> > > > > metadata and values.
>>> > > > > Content will be part of that, so I guess the conversion to a
>>> > > SolrDocument
>>> > > > > will be immediate.
>>> > > > >
>>> > > > > Cheers
>>> > > > >
>>> > > > >
>>> > > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
>>> > > > >
>>> > > > > > Hi Abe-san,
>>> > > > > >
>>> > > > > > Near as I can tell, the major consumer of disk space is the
>>> Maven
>>> > > > target
>>> > > > > > directories.  This is generating many tens of megabytes of
>>> > temporary
>>> > > > disk
>>> > > > > > usage for every connector.  Luckily if you use ant, this is
>>> not a
>>> > > > > problem.
>>> > > > > >
>>> > > > > > Karl
>>> > > > > >
>>> > > > > >
>>> > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <
>>> daddywri@gmail.com>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Hi Abe-san,
>>> > > > > > >
>>> > > > > > > Tika jars are not very big:
>>> > > > > > >
>>> > > > > > > C:\wip\mcf\trunk\lib>dir tika*
>>> > > > > > >  Volume in drive C has no label.
>>> > > > > > >  Volume Serial Number is 002E-D1F0
>>> > > > > > >
>>> > > > > > >  Directory of C:\wip\mcf\trunk\lib
>>> > > > > > >
>>> > > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
>>> > > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>>> > > > > > >                2 File(s)      1,017,051 bytes
>>> > > > > > >                0 Dir(s)  140,792,315,904 bytes free
>>> > > > > > >
>>> > > > > > > The entire lib directory is 85M:
>>> > > > > > >
>>> > > > > > > 85,156,330 bytes
>>> > > > > > >
>>> > > > > > > The built binary image is still about 185Mb, I believe.  So I
>>> > don't
>>> > > > > know
>>> > > > > > > why you think it is >1Gb?  Temporary class files?  I don't
>>> think
>>> > we
>>> > > > can
>>> > > > > > > avoid those.
>>> > > > > > >
>>> > > > > > > I'd rather not make things more complicated than they need
>>> to be
>>> > by
>>> > > > > > adding
>>> > > > > > > a new required service - even though it would fit naturally
>>> with
>>> > > the
>>> > > > > > > connector arrangement.
>>> > > > > > >
>>> > > > > > > Karl
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
>>> > > > > > > shinichiro.abe.1@gmail.com> wrote:
>>> > > > > > >
>>> > > > > > >> Hi Karl,
>>> > > > > > >>
>>> > > > > > >> Okay, I assumed Tika connector outputs files.
>>> > > > > > >> If we post character data metadata got from Tika,
>>> > > "/update/extract"
>>> > > > > > >> handler
>>> > > > > > >> can handle this(provides params:
>>> > > > > > >> literal.content=value&literal.metaField=foobar
>>> > > > > > >> with using NullInputStream for binary data like
>>> CONNECTORS-936).
>>> > > > > > >>
>>> > > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
>>> > > CloudSearch
>>> > > > > > >> connector uses Tika jars.
>>> > > > > > >> Tika connector and CloudSearch connector should extract
>>> text via
>>> > > > > > >> tika-server[1]
>>> > > > > > >> and MCF should not have many Tika jars, do you think?
>>> > > > > > >>
>>> > > > > > >> [1]
>>> > > > > > >> http://wiki.apache.org/tika/TikaJAXRS
>>> > > > > > >>
>>> > > > > > >> Thanks,
>>> > > > > > >> Shinichiro Abe
>>> > > > > > >>
>>> > > > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com>
>>> wrote:
>>> > > > > > >>
>>> > > > > > >> > Hi Abe-san,
>>> > > > > > >> >
>>> > > > > > >> > It sounds like you might be thinking that transformation
>>> > > > connectors
>>> > > > > > are
>>> > > > > > >> > like output connectors.  Just so we are clear,
>>> transformation
>>> > > > > > >> connectors in
>>> > > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
>>> > > > > > >> > RepositoryDocument on to the next connector in the chain.
>>>  So
>>> > I
>>> > > > > don't
>>> > > > > > >> know
>>> > > > > > >> > why .xml files would be involved.  I'd expect the Tika
>>> > connector
>>> > > > to
>>> > > > > > >> read a
>>> > > > > > >> > binary file from one RepositoryDocument object and
>>> convert its
>>> > > > > > contents
>>> > > > > > >> to
>>> > > > > > >> > another RepositoryDocument object which would have
>>> character
>>> > > data
>>> > > > > and
>>> > > > > > >> > metadata only.  Would this work for your case, do you
>>> think?
>>> > > > > > >> >
>>> > > > > > >> > Karl
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>>> > > > > > >> shinichiro.abe.1@gmail.com>
>>> > > > > > >> > wrote:
>>> > > > > > >> >
>>> > > > > > >> >> Hi Karl,
>>> > > > > > >> >>
>>> > > > > > >> >> Yes. I thought the standard update handler met that
>>> > > requirement.
>>> > > > > > >> >> For instance, Tika extractor transformation connector
>>> creates
>>> > > two
>>> > > > > > >> files.
>>> > > > > > >> >> 1. addtoSolr.xml for add and update
>>> > > > > > >> >> 2. deletetoSolr.xml for delete
>>> > > > > > >> >> File connector ingests these xml files, then Solr
>>> connector
>>> > > posts
>>> > > > > > these
>>> > > > > > >> >> files by "/update" handler.
>>> > > > > > >> >>
>>> > > > > > >> >> In the the Solr Connector, other function as to update
>>> > handler
>>> > > > > > >> >> might not be necessary except for  "/update" handler.
>>> > > > > > >> >>
>>> > > > > > >> >> Thanks,
>>> > > > > > >> >> Shinichiro Abe
>>> > > > > > >> >>
>>> > > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
>>> > > wrote:
>>> > > > > > >> >>
>>> > > > > > >> >>> Hi Abe-san,
>>> > > > > > >> >>>
>>> > > > > > >> >>> So just to be sure -- you believe that no changes at
>>> all are
>>> > > > > > required
>>> > > > > > >> to
>>> > > > > > >> >>> the Solr Connector as it stands now, other than to use
>>> the
>>> > > > update
>>> > > > > > >> handler
>>> > > > > > >> >>> rather than the /update/extract handler?
>>> > > > > > >> >>>
>>> > > > > > >> >>> Karl
>>> > > > > > >> >>>
>>> > > > > > >> >>>
>>> > > > > > >> >>>
>>> > > > > > >> >>>
>>> > > > > > >> >>>
>>> > > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>> > > > > > >> >> shinichiro.abe.1@gmail.com>
>>> > > > > > >> >>> wrote:
>>> > > > > > >> >>>
>>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't
>>> go
>>> > to
>>> > > > the
>>> > > > > > >> >> extracting
>>> > > > > > >> >>>> update handler
>>> > > > > > >> >>>>
>>> > > > > > >> >>>> I don't think it needs to change Solr connector with
>>> new
>>> > > > checkbox
>>> > > > > > >> >> because
>>> > > > > > >> >>>> currently we can change "/update/extract" into
>>> "/update" at
>>> > > > > 'Update
>>> > > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I
>>> confirmed I
>>> > > could
>>> > > > > > post
>>> > > > > > >> >> CSV,
>>> > > > > > >> >>>> JSON and XML files to Solr by changing that and using
>>> File
>>> > > > > > connector.
>>> > > > > > >> >> So I
>>> > > > > > >> >>>> wish we allow Tika extractor transformation connector
>>> to
>>> > > create
>>> > > > > XML
>>> > > > > > >> >> files
>>> > > > > > >> >>>> that Solr expects to see.
>>> > > > > > >> >>>>
>>> > > > > > >> >>>> Regards,
>>> > > > > > >> >>>> Shinichiro Abe
>>> > > > > > >> >>>>
>>> > > > > > >> >>>>
>>> > > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <
>>> daddywri@gmail.com
>>> > >:
>>> > > > > > >> >>>>
>>> > > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
>>> >  Zaizi
>>> > > > said
>>> > > > > > >> they'd
>>> > > > > > >> >>>>> contribute a Tika extractor transformation connector
>>> - and
>>> > > if
>>> > > > > they
>>> > > > > > >> >> don't
>>> > > > > > >> >>>>> get around to that in a month or so, I may take a
>>> crack at
>>> > > it
>>> > > > > > >> myself.
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't
>>> go
>>> > to
>>> > > > the
>>> > > > > > >> >>>> extracting
>>> > > > > > >> >>>>> update handler, it would be great if:
>>> > > > > > >> >>>>> (1) Someone created a ticket for this, and
>>> > > > > > >> >>>>> (2) A patch was provided that maintains backwards
>>> > > > compatibility
>>> > > > > > with
>>> > > > > > >> >>>>> previous versions of the connector (so a checkbox
>>> would
>>> > > > probably
>>> > > > > > >> need
>>> > > > > > >> >> to
>>> > > > > > >> >>>> go
>>> > > > > > >> >>>>> into the UI somewhere).  Do either of you want to
>>> start
>>> > this
>>> > > > > > >> process?
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>> Thanks!
>>> > > > > > >> >>>>> Karl
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
>>> > > > > daddywri@gmail.com
>>> > > > > > >
>>> > > > > > >> >>>> wrote:
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>>> Hi guys,
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has
>>> a
>>> > full
>>> > > > > > >> pipeline,
>>> > > > > > >> >>>> and
>>> > > > > > >> >>>>>> is expected to have a Tika extractor as a
>>> transformation
>>> > > > > > connector.
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>> Karl
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>> > > > > > >> >>>>> m.grolla@sourcesense.com>
>>> > > > > > >> >>>>>> wrote:
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>>> Thanks Alessandro,
>>> > > > > > >> >>>>>>>       that explains the situation clearly.
>>> > > > > > >> >>>>>>> And I agree that sending all the metadata as get
>>> > parameter
>>> > > > can
>>> > > > > > be
>>> > > > > > >> >>>>>>> problematic
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>> Cheers
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>> --
>>> > > > > > >> >>>>>>> Matteo Grolla
>>> > > > > > >> >>>>>>> Sourcesense - making sense of Open Source
>>> > > > > > >> >>>>>>> http://www.sourcesense.com
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
>>> > > Benedetti
>>> > > > ha
>>> > > > > > >> >>>> scritto:
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
>>> > > > > extractors.
>>> > > > > > >> >>>>>>>> The Repository connectors extracts directly the
>>> binary
>>> > > and
>>> > > > > > there
>>> > > > > > >> is
>>> > > > > > >> >>>> no
>>> > > > > > >> >>>>>>>> "Extractor Processor" yet.
>>> > > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
>>> > been
>>> > > > > > thought
>>> > > > > > >> (
>>> > > > > > >> >>>>>>>>
>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>> > > > > > >> >>>>>>>> So can fit there.
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> Cheers
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>>> > > > > > >> m.grolla@sourcesense.com
>>> > > > > > >> >>>>> :
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>>> Since Solr extracting request handler takes the
>>> binary
>>> > > and
>>> > > > > > >> extracts
>>> > > > > > >> >>>>>>> text
>>> > > > > > >> >>>>>>>>> what is the point of not using Manifold extractor
>>> and
>>> > > send
>>> > > > > > text
>>> > > > > > >> and
>>> > > > > > >> >>>>>>>>> binaries to solr?
>>> > > > > > >> >>>>>>>>> I mean the end result is the same solr indexes
>>> text
>>> > and
>>> > > > > stores
>>> > > > > > >> text
>>> > > > > > >> >>>>>>>>> So if manifold supports text extraction it seems
>>> me
>>> > this
>>> > > > is
>>> > > > > > the
>>> > > > > > >> >>>> place
>>> > > > > > >> >>>>>>>>> where it should be done
>>> > > > > > >> >>>>>>>>>
>>> > > > > > >> >>>>>>>>> --
>>> > > > > > >> >>>>>>>>> Matteo Grolla
>>> > > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
>>> > > > > > >> >>>>>>>>> http://www.sourcesense.com
>>> > > > > > >> >>>>>>>>>
>>> > > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio
>>> David
>>> > > Perez
>>> > > > > > >> Morales
>>> > > > > > >> >>>> ha
>>> > > > > > >> >>>>>>>>> scritto:
>>> > > > > > >> >>>>>>>>>
>>> > > > > > >> >>>>>>>>>> Hi Matteo
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the
>>> only
>>> > > way
>>> > > > > to
>>> > > > > > >> send
>>> > > > > > >> >>>>>>> binary
>>> > > > > > >> >>>>>>>>>> content and document metadata to Solr is using
>>> the
>>> > > > > > >> update/extract
>>> > > > > > >> >>>>>>>>> handler,
>>> > > > > > >> >>>>>>>>>> where the metadata is sent as query parameters
>>> and
>>> > the
>>> > > > > binary
>>> > > > > > >> >>>>> content
>>> > > > > > >> >>>>>>> is
>>> > > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr
>>> to
>>> > use
>>> > > > Tika
>>> > > > > > to
>>> > > > > > >> >>>>> obtain
>>> > > > > > >> >>>>>>> the
>>> > > > > > >> >>>>>>>>>> raw content to be stored in Solr.
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> Regards
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>> > > > > > >> >>>>>>> m.grolla@sourcesense.com
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> wrote:
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that
>>> manifold
>>> > > uses
>>> > > > > > Solr
>>> > > > > > >> >>>>>>> extracting
>>> > > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
>>> > file
>>> > > > > > >> >>>>>>>>>>> For performance reasons it would be better if
>>> > Manifold
>>> > > > > > handled
>>> > > > > > >> >>>> the
>>> > > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
>>> > > > > > >> >>>>>>>>>>> Is this because of the connector design,
>>> framework
>>> > > > design
>>> > > > > or
>>> > > > > > >> just
>>> > > > > > >> >>>>> to
>>> > > > > > >> >>>>>>> be
>>> > > > > > >> >>>>>>>>>>> done?
>>> > > > > > >> >>>>>>>>>>>
>>> > > > > > >> >>>>>>>>>>> --
>>> > > > > > >> >>>>>>>>>>> Matteo Grolla
>>> > > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
>>> > > > > > >> >>>>>>>>>>> http://www.sourcesense.com
>>> > > > > > >> >>>>>>>>>>>
>>> > > > > > >> >>>>>>>>>>>
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> --
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> ------------------------------
>>> > > > > > >> >>>>>>>>>> This message should be regarded as confidential.
>>> If
>>> > you
>>> > > > > have
>>> > > > > > >> >>>>> received
>>> > > > > > >> >>>>>>>>> this
>>> > > > > > >> >>>>>>>>>> email in error please notify the sender and
>>> destroy
>>> > it
>>> > > > > > >> >>>> immediately.
>>> > > > > > >> >>>>>>>>>> Statements of intent shall only become binding
>>> when
>>> > > > > confirmed
>>> > > > > > >> in
>>> > > > > > >> >>>>> hard
>>> > > > > > >> >>>>>>>>> copy
>>> > > > > > >> >>>>>>>>>> by an authorised signatory.
>>> > > > > > >> >>>>>>>>>>
>>> > > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales
>>> with the
>>> > > > > > >> registration
>>> > > > > > >> >>>>>>> number
>>> > > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House,
>>> 229
>>> > > > > Shepherds
>>> > > > > > >> Bush
>>> > > > > > >> >>>>>>> Road,
>>> > > > > > >> >>>>>>>>>> London W6 7AN.
>>> > > > > > >> >>>>>>>>>
>>> > > > > > >> >>>>>>>>>
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> --
>>> > > > > > >> >>>>>>>> --------------------------
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> Benedetti Alessandro
>>> > > > > > >> >>>>>>>> Visiting card :
>>> http://about.me/alessandro_benedetti
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> "Tyger, tyger burning bright
>>> > > > > > >> >>>>>>>> In the forests of the night,
>>> > > > > > >> >>>>>>>> What immortal hand or eye
>>> > > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
>>> > > > > > >> >>>>>>>>
>>> > > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>>
>>> > > > > > >> >>>>>>
>>> > > > > > >> >>>>>
>>> > > > > > >> >>>>
>>> > > > > > >> >>>>
>>> > > > > > >> >>>>
>>> > > > > > >> >>>> --
>>> > > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>> - - -
>>> > > - -
>>> > > > > > >> >>>> Shinichiro Abe
>>> > > > > > >> >>>> 阿部 慎一朗
>>> > > > > > >> >>>>
>>> > > > > > >> >>
>>> > > > > > >> >>
>>> > > > > > >>
>>> > > > > > >>
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > --------------------------
>>> > > > >
>>> > > > > Benedetti Alessandro
>>> > > > > Visiting card : http://about.me/alessandro_benedetti
>>> > > > >
>>> > > > > "Tyger, tyger burning bright
>>> > > > > In the forests of the night,
>>> > > > > What immortal hand or eye
>>> > > > > Could frame thy fearful symmetry?"
>>> > > > >
>>> > > > > William Blake - Songs of Experience -1794 England
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > --------------------------
>>> > >
>>> > > Benedetti Alessandro
>>> > > Visiting card : http://about.me/alessandro_benedetti
>>> > >
>>> > > "Tyger, tyger burning bright
>>> > > In the forests of the night,
>>> > > What immortal hand or eye
>>> > > Could frame thy fearful symmetry?"
>>> > >
>>> > > William Blake - Songs of Experience -1794 England
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Since a Tika transformer is critical to this plan, I'm going to code one up
now.  Stay tuned!
Karl


On Wed, Jun 18, 2014 at 11:59 AM, Karl Wright <da...@gmail.com> wrote:

> bq. I don't agree on this. Why is not appropriate for all the connectors ?
>
> Some output connectors want the document in binary form -- e.g. the HDFS
> and  FileSystem connectors, which don't deal with metadata at all.  It's
> not clear whether the Tika transformer would preserve the binary stream, or
> would replace the binary stream with an extracted content stream.  I'd
> kind-of expect the latter, but there are other ways to do it, of course.
> But it would certainly impact performance, so it should not be a
> requirement.  Not only that, but there's no *reason* to make it a
> requirement, since you can very readily add it or remove it from the
> pipeline in the UI.
>
> bq. So what is the problem of using Tika outside Solr?
>
> We've seen a number of cases where Tika inside Solr does things based on
> (for instance) http headers that Solr receives.  Abe-san had some
> difficulty with that a while back.  We had to repeatedly fix things when we
> went to SolrJ to make sure various headers were compatible so that SolrCell
> worked the same.  I'd rather not re-implement SolrCell precisely in
> ManifoldCF if I can help it.
>
> bq. Solr Extract is using Tika under the hood, nothing more.
>
> It's more complicated than that.  Have a look at the code.
>
> bq. probably a simple flag can fit to operate in one way or another.
>
> I agree that that should be sufficient.
>
> Karl
>
>
>
> On Wed, Jun 18, 2014 at 11:35 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
>> 2014-06-18 16:10 GMT+01:00 Karl Wright <da...@gmail.com>:
>>
>> > Hi Alessandro,
>> >
>> > The reason for backwards compatibility is obvious: people upgrade
>> > ManifoldCF all the time, and when they do it should not stop working for
>> > them.
>> >
>> Ok i agree !
>>
>> >
>> > Putting Tika all the time in the pipeline is also not appropriate for
>> other
>> > output connections.
>>
>>
>> I don't agree on this. Why is not appropriate for all the connectors ?
>> The conceptual responsibility of an output Connector should be to post a
>> RespositoryDocument to an output ( whatever we want) .
>> A RepositoryDocument is a map Field-> value.
>> The content is nothing than a one of these fields.
>> So I can not see why after we have a RepositoryDocument ( with content
>> extracted) , should not be possible to send it independently to any
>> OutputConnector.
>>
>>
>> >  Even if you did it just for Solr, you'd then have to
>> > insure that the Tika transformer was exactly compatible with Solr Cell,
>> > which I would be very uncomfortable with agreeing to.
>> >
>>
>> So what is the problem of using Tika outside Solr? We will add the most
>> recent version of Tika, that will be gradually upgraded over time with the
>> platform.
>>
>> Solr Extract is using Tika under the hood, nothing more.
>>
>>
>>
>> > So let's presume that you'd do one of two things.  Either:
>> >
>> > - Leave the existing Solr connector alone, and create a whole new Solr
>> > connector designed to work with a Tika transformer, or
>> > - Modify the existing Solr connector so that it operates in two possible
>> > modes, one of which supports the legacy model (the default), and one of
>> > which supports your new model
>> >
>>
>> probably a simple flag can fit to operate in one way or another.
>>
>> >
>> > If this sounds overly burdensome, I'm sorry but it's necessary until MCF
>> > 2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
>> > backwards compatibility, including legacy tabs that have outlived their
>> > usefulness, etc.  But that's not a 1.7 solution.
>> >
>> > Karl
>> >
>>
>> Cheers
>>
>> >
>> >
>> >
>> > On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
>> > benedetti.alex85@gmail.com> wrote:
>> >
>> > > Hello Karl,
>> > > What i was thinking is:
>> > > assuming we have the Tika Connector, the responsibility to extract
>> > content
>> > > will pass from Solr to the Tika processor.
>> > >
>> > > So we can change the part in the Solr Connector that manages the
>> building
>> > > of the request to send to the Extract update handler.
>> > > Particularly that part will change in the classic way: usually it's
>> good
>> > to
>> > > build a SolrDocument in SolrJ and then add it to SolrServer.
>> > >
>> > > Why should we give retrocompatibility from Solr Connector point of
>> view ?
>> > > From the user point of view, a Job will be selected with the Tika
>> > Conenctor
>> > > in the pipeline, so we are providing the same identical feature.
>> > > One way can be to make the Tika Processor Connector by default in the
>> > > pipeline, and someone will be able to deactivate it only if needed.
>> > >
>> > > Cheers
>> > >
>> > >
>> > >
>> > > 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
>> > >
>> > > > Hi Alessandro,
>> > > > What is your concrete proposal to change the Solr connector?  Bear
>> in
>> > > mind
>> > > > that we do need to maintain backwards compatibility.  If you list
>> your
>> > > > specific changes, not in any huge detail, but with enough detail
>> that
>> > we
>> > > > understand your proposal, that would help.  What happens to the UI?
>> >  What
>> > > > happens to the internals?
>> > > >
>> > > > Thanks,
>> > > > Karl
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
>> > > > benedetti.alex85@gmail.com> wrote:
>> > > >
>> > > > > But guys, why not simply pass to a classic SolrJ SolrDocument
>> > creation
>> > > > and
>> > > > > ingestion in the Solr Server ? Easy and Straighforward !
>> > > > >
>> > > > > In the end at that point the RepositoryDocument will me only a
>> Map of
>> > > > > metadata and values.
>> > > > > Content will be part of that, so I guess the conversion to a
>> > > SolrDocument
>> > > > > will be immediate.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > >
>> > > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
>> > > > >
>> > > > > > Hi Abe-san,
>> > > > > >
>> > > > > > Near as I can tell, the major consumer of disk space is the
>> Maven
>> > > > target
>> > > > > > directories.  This is generating many tens of megabytes of
>> > temporary
>> > > > disk
>> > > > > > usage for every connector.  Luckily if you use ant, this is not
>> a
>> > > > > problem.
>> > > > > >
>> > > > > > Karl
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <
>> daddywri@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi Abe-san,
>> > > > > > >
>> > > > > > > Tika jars are not very big:
>> > > > > > >
>> > > > > > > C:\wip\mcf\trunk\lib>dir tika*
>> > > > > > >  Volume in drive C has no label.
>> > > > > > >  Volume Serial Number is 002E-D1F0
>> > > > > > >
>> > > > > > >  Directory of C:\wip\mcf\trunk\lib
>> > > > > > >
>> > > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
>> > > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>> > > > > > >                2 File(s)      1,017,051 bytes
>> > > > > > >                0 Dir(s)  140,792,315,904 bytes free
>> > > > > > >
>> > > > > > > The entire lib directory is 85M:
>> > > > > > >
>> > > > > > > 85,156,330 bytes
>> > > > > > >
>> > > > > > > The built binary image is still about 185Mb, I believe.  So I
>> > don't
>> > > > > know
>> > > > > > > why you think it is >1Gb?  Temporary class files?  I don't
>> think
>> > we
>> > > > can
>> > > > > > > avoid those.
>> > > > > > >
>> > > > > > > I'd rather not make things more complicated than they need to
>> be
>> > by
>> > > > > > adding
>> > > > > > > a new required service - even though it would fit naturally
>> with
>> > > the
>> > > > > > > connector arrangement.
>> > > > > > >
>> > > > > > > Karl
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
>> > > > > > > shinichiro.abe.1@gmail.com> wrote:
>> > > > > > >
>> > > > > > >> Hi Karl,
>> > > > > > >>
>> > > > > > >> Okay, I assumed Tika connector outputs files.
>> > > > > > >> If we post character data metadata got from Tika,
>> > > "/update/extract"
>> > > > > > >> handler
>> > > > > > >> can handle this(provides params:
>> > > > > > >> literal.content=value&literal.metaField=foobar
>> > > > > > >> with using NullInputStream for binary data like
>> CONNECTORS-936).
>> > > > > > >>
>> > > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
>> > > CloudSearch
>> > > > > > >> connector uses Tika jars.
>> > > > > > >> Tika connector and CloudSearch connector should extract text
>> via
>> > > > > > >> tika-server[1]
>> > > > > > >> and MCF should not have many Tika jars, do you think?
>> > > > > > >>
>> > > > > > >> [1]
>> > > > > > >> http://wiki.apache.org/tika/TikaJAXRS
>> > > > > > >>
>> > > > > > >> Thanks,
>> > > > > > >> Shinichiro Abe
>> > > > > > >>
>> > > > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com>
>> wrote:
>> > > > > > >>
>> > > > > > >> > Hi Abe-san,
>> > > > > > >> >
>> > > > > > >> > It sounds like you might be thinking that transformation
>> > > > connectors
>> > > > > > are
>> > > > > > >> > like output connectors.  Just so we are clear,
>> transformation
>> > > > > > >> connectors in
>> > > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
>> > > > > > >> > RepositoryDocument on to the next connector in the chain.
>>  So
>> > I
>> > > > > don't
>> > > > > > >> know
>> > > > > > >> > why .xml files would be involved.  I'd expect the Tika
>> > connector
>> > > > to
>> > > > > > >> read a
>> > > > > > >> > binary file from one RepositoryDocument object and convert
>> its
>> > > > > > contents
>> > > > > > >> to
>> > > > > > >> > another RepositoryDocument object which would have
>> character
>> > > data
>> > > > > and
>> > > > > > >> > metadata only.  Would this work for your case, do you
>> think?
>> > > > > > >> >
>> > > > > > >> > Karl
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>> > > > > > >> shinichiro.abe.1@gmail.com>
>> > > > > > >> > wrote:
>> > > > > > >> >
>> > > > > > >> >> Hi Karl,
>> > > > > > >> >>
>> > > > > > >> >> Yes. I thought the standard update handler met that
>> > > requirement.
>> > > > > > >> >> For instance, Tika extractor transformation connector
>> creates
>> > > two
>> > > > > > >> files.
>> > > > > > >> >> 1. addtoSolr.xml for add and update
>> > > > > > >> >> 2. deletetoSolr.xml for delete
>> > > > > > >> >> File connector ingests these xml files, then Solr
>> connector
>> > > posts
>> > > > > > these
>> > > > > > >> >> files by "/update" handler.
>> > > > > > >> >>
>> > > > > > >> >> In the the Solr Connector, other function as to update
>> > handler
>> > > > > > >> >> might not be necessary except for  "/update" handler.
>> > > > > > >> >>
>> > > > > > >> >> Thanks,
>> > > > > > >> >> Shinichiro Abe
>> > > > > > >> >>
>> > > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
>> > > wrote:
>> > > > > > >> >>
>> > > > > > >> >>> Hi Abe-san,
>> > > > > > >> >>>
>> > > > > > >> >>> So just to be sure -- you believe that no changes at all
>> are
>> > > > > > required
>> > > > > > >> to
>> > > > > > >> >>> the Solr Connector as it stands now, other than to use
>> the
>> > > > update
>> > > > > > >> handler
>> > > > > > >> >>> rather than the /update/extract handler?
>> > > > > > >> >>>
>> > > > > > >> >>> Karl
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>>
>> > > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>> > > > > > >> >> shinichiro.abe.1@gmail.com>
>> > > > > > >> >>> wrote:
>> > > > > > >> >>>
>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't
>> go
>> > to
>> > > > the
>> > > > > > >> >> extracting
>> > > > > > >> >>>> update handler
>> > > > > > >> >>>>
>> > > > > > >> >>>> I don't think it needs to change Solr connector with new
>> > > > checkbox
>> > > > > > >> >> because
>> > > > > > >> >>>> currently we can change "/update/extract" into
>> "/update" at
>> > > > > 'Update
>> > > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed
>> I
>> > > could
>> > > > > > post
>> > > > > > >> >> CSV,
>> > > > > > >> >>>> JSON and XML files to Solr by changing that and using
>> File
>> > > > > > connector.
>> > > > > > >> >> So I
>> > > > > > >> >>>> wish we allow Tika extractor transformation connector to
>> > > create
>> > > > > XML
>> > > > > > >> >> files
>> > > > > > >> >>>> that Solr expects to see.
>> > > > > > >> >>>>
>> > > > > > >> >>>> Regards,
>> > > > > > >> >>>> Shinichiro Abe
>> > > > > > >> >>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <
>> daddywri@gmail.com
>> > >:
>> > > > > > >> >>>>
>> > > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
>> >  Zaizi
>> > > > said
>> > > > > > >> they'd
>> > > > > > >> >>>>> contribute a Tika extractor transformation connector -
>> and
>> > > if
>> > > > > they
>> > > > > > >> >> don't
>> > > > > > >> >>>>> get around to that in a month or so, I may take a
>> crack at
>> > > it
>> > > > > > >> myself.
>> > > > > > >> >>>>>
>> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't
>> go
>> > to
>> > > > the
>> > > > > > >> >>>> extracting
>> > > > > > >> >>>>> update handler, it would be great if:
>> > > > > > >> >>>>> (1) Someone created a ticket for this, and
>> > > > > > >> >>>>> (2) A patch was provided that maintains backwards
>> > > > compatibility
>> > > > > > with
>> > > > > > >> >>>>> previous versions of the connector (so a checkbox would
>> > > > probably
>> > > > > > >> need
>> > > > > > >> >> to
>> > > > > > >> >>>> go
>> > > > > > >> >>>>> into the UI somewhere).  Do either of you want to start
>> > this
>> > > > > > >> process?
>> > > > > > >> >>>>>
>> > > > > > >> >>>>> Thanks!
>> > > > > > >> >>>>> Karl
>> > > > > > >> >>>>>
>> > > > > > >> >>>>>
>> > > > > > >> >>>>>
>> > > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
>> > > > > daddywri@gmail.com
>> > > > > > >
>> > > > > > >> >>>> wrote:
>> > > > > > >> >>>>>
>> > > > > > >> >>>>>> Hi guys,
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a
>> > full
>> > > > > > >> pipeline,
>> > > > > > >> >>>> and
>> > > > > > >> >>>>>> is expected to have a Tika extractor as a
>> transformation
>> > > > > > connector.
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>> Karl
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>> > > > > > >> >>>>> m.grolla@sourcesense.com>
>> > > > > > >> >>>>>> wrote:
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>>> Thanks Alessandro,
>> > > > > > >> >>>>>>>       that explains the situation clearly.
>> > > > > > >> >>>>>>> And I agree that sending all the metadata as get
>> > parameter
>> > > > can
>> > > > > > be
>> > > > > > >> >>>>>>> problematic
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>> Cheers
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>> --
>> > > > > > >> >>>>>>> Matteo Grolla
>> > > > > > >> >>>>>>> Sourcesense - making sense of Open Source
>> > > > > > >> >>>>>>> http://www.sourcesense.com
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
>> > > Benedetti
>> > > > ha
>> > > > > > >> >>>> scritto:
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
>> > > > > extractors.
>> > > > > > >> >>>>>>>> The Repository connectors extracts directly the
>> binary
>> > > and
>> > > > > > there
>> > > > > > >> is
>> > > > > > >> >>>> no
>> > > > > > >> >>>>>>>> "Extractor Processor" yet.
>> > > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
>> > been
>> > > > > > thought
>> > > > > > >> (
>> > > > > > >> >>>>>>>>
>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>> > > > > > >> >>>>>>>> So can fit there.
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> Cheers
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>> > > > > > >> m.grolla@sourcesense.com
>> > > > > > >> >>>>> :
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>>> Since Solr extracting request handler takes the
>> binary
>> > > and
>> > > > > > >> extracts
>> > > > > > >> >>>>>>> text
>> > > > > > >> >>>>>>>>> what is the point of not using Manifold extractor
>> and
>> > > send
>> > > > > > text
>> > > > > > >> and
>> > > > > > >> >>>>>>>>> binaries to solr?
>> > > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text
>> > and
>> > > > > stores
>> > > > > > >> text
>> > > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me
>> > this
>> > > > is
>> > > > > > the
>> > > > > > >> >>>> place
>> > > > > > >> >>>>>>>>> where it should be done
>> > > > > > >> >>>>>>>>>
>> > > > > > >> >>>>>>>>> --
>> > > > > > >> >>>>>>>>> Matteo Grolla
>> > > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
>> > > > > > >> >>>>>>>>> http://www.sourcesense.com
>> > > > > > >> >>>>>>>>>
>> > > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio
>> David
>> > > Perez
>> > > > > > >> Morales
>> > > > > > >> >>>> ha
>> > > > > > >> >>>>>>>>> scritto:
>> > > > > > >> >>>>>>>>>
>> > > > > > >> >>>>>>>>>> Hi Matteo
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the
>> only
>> > > way
>> > > > > to
>> > > > > > >> send
>> > > > > > >> >>>>>>> binary
>> > > > > > >> >>>>>>>>>> content and document metadata to Solr is using the
>> > > > > > >> update/extract
>> > > > > > >> >>>>>>>>> handler,
>> > > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and
>> > the
>> > > > > binary
>> > > > > > >> >>>>> content
>> > > > > > >> >>>>>>> is
>> > > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to
>> > use
>> > > > Tika
>> > > > > > to
>> > > > > > >> >>>>> obtain
>> > > > > > >> >>>>>>> the
>> > > > > > >> >>>>>>>>>> raw content to be stored in Solr.
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> Regards
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>> > > > > > >> >>>>>>> m.grolla@sourcesense.com
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> wrote:
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that
>> manifold
>> > > uses
>> > > > > > Solr
>> > > > > > >> >>>>>>> extracting
>> > > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
>> > file
>> > > > > > >> >>>>>>>>>>> For performance reasons it would be better if
>> > Manifold
>> > > > > > handled
>> > > > > > >> >>>> the
>> > > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
>> > > > > > >> >>>>>>>>>>> Is this because of the connector design,
>> framework
>> > > > design
>> > > > > or
>> > > > > > >> just
>> > > > > > >> >>>>> to
>> > > > > > >> >>>>>>> be
>> > > > > > >> >>>>>>>>>>> done?
>> > > > > > >> >>>>>>>>>>>
>> > > > > > >> >>>>>>>>>>> --
>> > > > > > >> >>>>>>>>>>> Matteo Grolla
>> > > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
>> > > > > > >> >>>>>>>>>>> http://www.sourcesense.com
>> > > > > > >> >>>>>>>>>>>
>> > > > > > >> >>>>>>>>>>>
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> --
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> ------------------------------
>> > > > > > >> >>>>>>>>>> This message should be regarded as confidential.
>> If
>> > you
>> > > > > have
>> > > > > > >> >>>>> received
>> > > > > > >> >>>>>>>>> this
>> > > > > > >> >>>>>>>>>> email in error please notify the sender and
>> destroy
>> > it
>> > > > > > >> >>>> immediately.
>> > > > > > >> >>>>>>>>>> Statements of intent shall only become binding
>> when
>> > > > > confirmed
>> > > > > > >> in
>> > > > > > >> >>>>> hard
>> > > > > > >> >>>>>>>>> copy
>> > > > > > >> >>>>>>>>>> by an authorised signatory.
>> > > > > > >> >>>>>>>>>>
>> > > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with
>> the
>> > > > > > >> registration
>> > > > > > >> >>>>>>> number
>> > > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
>> > > > > Shepherds
>> > > > > > >> Bush
>> > > > > > >> >>>>>>> Road,
>> > > > > > >> >>>>>>>>>> London W6 7AN.
>> > > > > > >> >>>>>>>>>
>> > > > > > >> >>>>>>>>>
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> --
>> > > > > > >> >>>>>>>> --------------------------
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> Benedetti Alessandro
>> > > > > > >> >>>>>>>> Visiting card :
>> http://about.me/alessandro_benedetti
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> "Tyger, tyger burning bright
>> > > > > > >> >>>>>>>> In the forests of the night,
>> > > > > > >> >>>>>>>> What immortal hand or eye
>> > > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
>> > > > > > >> >>>>>>>>
>> > > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>>
>> > > > > > >> >>>>>>
>> > > > > > >> >>>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>>
>> > > > > > >> >>>> --
>> > > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> - -
>> > > - -
>> > > > > > >> >>>> Shinichiro Abe
>> > > > > > >> >>>> 阿部 慎一朗
>> > > > > > >> >>>>
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > --------------------------
>> > > > >
>> > > > > Benedetti Alessandro
>> > > > > Visiting card : http://about.me/alessandro_benedetti
>> > > > >
>> > > > > "Tyger, tyger burning bright
>> > > > > In the forests of the night,
>> > > > > What immortal hand or eye
>> > > > > Could frame thy fearful symmetry?"
>> > > > >
>> > > > > William Blake - Songs of Experience -1794 England
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > --------------------------
>> > >
>> > > Benedetti Alessandro
>> > > Visiting card : http://about.me/alessandro_benedetti
>> > >
>> > > "Tyger, tyger burning bright
>> > > In the forests of the night,
>> > > What immortal hand or eye
>> > > Could frame thy fearful symmetry?"
>> > >
>> > > William Blake - Songs of Experience -1794 England
>> > >
>> >
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

bq. I don't agree on this. Why is not appropriate for all the connectors ?

Some output connectors want the document in binary form -- e.g. the HDFS
and  FileSystem connectors, which don't deal with metadata at all.  It's
not clear whether the Tika transformer would preserve the binary stream, or
would replace the binary stream with an extracted content stream.  I'd
kind-of expect the latter, but there are other ways to do it, of course.
But it would certainly impact performance, so it should not be a
requirement.  Not only that, but there's no *reason* to make it a
requirement, since you can very readily add it or remove it from the
pipeline in the UI.

bq. So what is the problem of using Tika outside Solr?

We've seen a number of cases where Tika inside Solr does things based on
(for instance) http headers that Solr receives.  Abe-san had some
difficulty with that a while back.  We had to repeatedly fix things when we
went to SolrJ to make sure various headers were compatible so that SolrCell
worked the same.  I'd rather not re-implement SolrCell precisely in
ManifoldCF if I can help it.

bq. Solr Extract is using Tika under the hood, nothing more.

It's more complicated than that.  Have a look at the code.

bq. probably a simple flag can fit to operate in one way or another.

I agree that that should be sufficient.

Karl



On Wed, Jun 18, 2014 at 11:35 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> 2014-06-18 16:10 GMT+01:00 Karl Wright <da...@gmail.com>:
>
> > Hi Alessandro,
> >
> > The reason for backwards compatibility is obvious: people upgrade
> > ManifoldCF all the time, and when they do it should not stop working for
> > them.
> >
> Ok i agree !
>
> >
> > Putting Tika all the time in the pipeline is also not appropriate for
> other
> > output connections.
>
>
> I don't agree on this. Why is not appropriate for all the connectors ?
> The conceptual responsibility of an output Connector should be to post a
> RespositoryDocument to an output ( whatever we want) .
> A RepositoryDocument is a map Field-> value.
> The content is nothing than a one of these fields.
> So I can not see why after we have a RepositoryDocument ( with content
> extracted) , should not be possible to send it independently to any
> OutputConnector.
>
>
> >  Even if you did it just for Solr, you'd then have to
> > insure that the Tika transformer was exactly compatible with Solr Cell,
> > which I would be very uncomfortable with agreeing to.
> >
>
> So what is the problem of using Tika outside Solr? We will add the most
> recent version of Tika, that will be gradually upgraded over time with the
> platform.
>
> Solr Extract is using Tika under the hood, nothing more.
>
>
>
> > So let's presume that you'd do one of two things.  Either:
> >
> > - Leave the existing Solr connector alone, and create a whole new Solr
> > connector designed to work with a Tika transformer, or
> > - Modify the existing Solr connector so that it operates in two possible
> > modes, one of which supports the legacy model (the default), and one of
> > which supports your new model
> >
>
> probably a simple flag can fit to operate in one way or another.
>
> >
> > If this sounds overly burdensome, I'm sorry but it's necessary until MCF
> > 2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
> > backwards compatibility, including legacy tabs that have outlived their
> > usefulness, etc.  But that's not a 1.7 solution.
> >
> > Karl
> >
>
> Cheers
>
> >
> >
> >
> > On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
> > benedetti.alex85@gmail.com> wrote:
> >
> > > Hello Karl,
> > > What i was thinking is:
> > > assuming we have the Tika Connector, the responsibility to extract
> > content
> > > will pass from Solr to the Tika processor.
> > >
> > > So we can change the part in the Solr Connector that manages the
> building
> > > of the request to send to the Extract update handler.
> > > Particularly that part will change in the classic way: usually it's
> good
> > to
> > > build a SolrDocument in SolrJ and then add it to SolrServer.
> > >
> > > Why should we give retrocompatibility from Solr Connector point of
> view ?
> > > From the user point of view, a Job will be selected with the Tika
> > Conenctor
> > > in the pipeline, so we are providing the same identical feature.
> > > One way can be to make the Tika Processor Connector by default in the
> > > pipeline, and someone will be able to deactivate it only if needed.
> > >
> > > Cheers
> > >
> > >
> > >
> > > 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
> > >
> > > > Hi Alessandro,
> > > > What is your concrete proposal to change the Solr connector?  Bear in
> > > mind
> > > > that we do need to maintain backwards compatibility.  If you list
> your
> > > > specific changes, not in any huge detail, but with enough detail that
> > we
> > > > understand your proposal, that would help.  What happens to the UI?
> >  What
> > > > happens to the internals?
> > > >
> > > > Thanks,
> > > > Karl
> > > >
> > > >
> > > >
> > > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> > > > benedetti.alex85@gmail.com> wrote:
> > > >
> > > > > But guys, why not simply pass to a classic SolrJ SolrDocument
> > creation
> > > > and
> > > > > ingestion in the Solr Server ? Easy and Straighforward !
> > > > >
> > > > > In the end at that point the RepositoryDocument will me only a Map
> of
> > > > > metadata and values.
> > > > > Content will be part of that, so I guess the conversion to a
> > > SolrDocument
> > > > > will be immediate.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
> > > > >
> > > > > > Hi Abe-san,
> > > > > >
> > > > > > Near as I can tell, the major consumer of disk space is the Maven
> > > > target
> > > > > > directories.  This is generating many tens of megabytes of
> > temporary
> > > > disk
> > > > > > usage for every connector.  Luckily if you use ant, this is not a
> > > > > problem.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <daddywri@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Abe-san,
> > > > > > >
> > > > > > > Tika jars are not very big:
> > > > > > >
> > > > > > > C:\wip\mcf\trunk\lib>dir tika*
> > > > > > >  Volume in drive C has no label.
> > > > > > >  Volume Serial Number is 002E-D1F0
> > > > > > >
> > > > > > >  Directory of C:\wip\mcf\trunk\lib
> > > > > > >
> > > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > > > > >                2 File(s)      1,017,051 bytes
> > > > > > >                0 Dir(s)  140,792,315,904 bytes free
> > > > > > >
> > > > > > > The entire lib directory is 85M:
> > > > > > >
> > > > > > > 85,156,330 bytes
> > > > > > >
> > > > > > > The built binary image is still about 185Mb, I believe.  So I
> > don't
> > > > > know
> > > > > > > why you think it is >1Gb?  Temporary class files?  I don't
> think
> > we
> > > > can
> > > > > > > avoid those.
> > > > > > >
> > > > > > > I'd rather not make things more complicated than they need to
> be
> > by
> > > > > > adding
> > > > > > > a new required service - even though it would fit naturally
> with
> > > the
> > > > > > > connector arrangement.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > > > >
> > > > > > >> Hi Karl,
> > > > > > >>
> > > > > > >> Okay, I assumed Tika connector outputs files.
> > > > > > >> If we post character data metadata got from Tika,
> > > "/update/extract"
> > > > > > >> handler
> > > > > > >> can handle this(provides params:
> > > > > > >> literal.content=value&literal.metaField=foobar
> > > > > > >> with using NullInputStream for binary data like
> CONNECTORS-936).
> > > > > > >>
> > > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
> > > CloudSearch
> > > > > > >> connector uses Tika jars.
> > > > > > >> Tika connector and CloudSearch connector should extract text
> via
> > > > > > >> tika-server[1]
> > > > > > >> and MCF should not have many Tika jars, do you think?
> > > > > > >>
> > > > > > >> [1]
> > > > > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Shinichiro Abe
> > > > > > >>
> > > > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com>
> wrote:
> > > > > > >>
> > > > > > >> > Hi Abe-san,
> > > > > > >> >
> > > > > > >> > It sounds like you might be thinking that transformation
> > > > connectors
> > > > > > are
> > > > > > >> > like output connectors.  Just so we are clear,
> transformation
> > > > > > >> connectors in
> > > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > > > > >> > RepositoryDocument on to the next connector in the chain.
>  So
> > I
> > > > > don't
> > > > > > >> know
> > > > > > >> > why .xml files would be involved.  I'd expect the Tika
> > connector
> > > > to
> > > > > > >> read a
> > > > > > >> > binary file from one RepositoryDocument object and convert
> its
> > > > > > contents
> > > > > > >> to
> > > > > > >> > another RepositoryDocument object which would have character
> > > data
> > > > > and
> > > > > > >> > metadata only.  Would this work for your case, do you think?
> > > > > > >> >
> > > > > > >> > Karl
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > > > > >> shinichiro.abe.1@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> >> Hi Karl,
> > > > > > >> >>
> > > > > > >> >> Yes. I thought the standard update handler met that
> > > requirement.
> > > > > > >> >> For instance, Tika extractor transformation connector
> creates
> > > two
> > > > > > >> files.
> > > > > > >> >> 1. addtoSolr.xml for add and update
> > > > > > >> >> 2. deletetoSolr.xml for delete
> > > > > > >> >> File connector ingests these xml files, then Solr connector
> > > posts
> > > > > > these
> > > > > > >> >> files by "/update" handler.
> > > > > > >> >>
> > > > > > >> >> In the the Solr Connector, other function as to update
> > handler
> > > > > > >> >> might not be necessary except for  "/update" handler.
> > > > > > >> >>
> > > > > > >> >> Thanks,
> > > > > > >> >> Shinichiro Abe
> > > > > > >> >>
> > > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
> > > wrote:
> > > > > > >> >>
> > > > > > >> >>> Hi Abe-san,
> > > > > > >> >>>
> > > > > > >> >>> So just to be sure -- you believe that no changes at all
> are
> > > > > > required
> > > > > > >> to
> > > > > > >> >>> the Solr Connector as it stands now, other than to use the
> > > > update
> > > > > > >> handler
> > > > > > >> >>> rather than the /update/extract handler?
> > > > > > >> >>>
> > > > > > >> >>> Karl
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > > > > >> >> shinichiro.abe.1@gmail.com>
> > > > > > >> >>> wrote:
> > > > > > >> >>>
> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> > to
> > > > the
> > > > > > >> >> extracting
> > > > > > >> >>>> update handler
> > > > > > >> >>>>
> > > > > > >> >>>> I don't think it needs to change Solr connector with new
> > > > checkbox
> > > > > > >> >> because
> > > > > > >> >>>> currently we can change "/update/extract" into "/update"
> at
> > > > > 'Update
> > > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I
> > > could
> > > > > > post
> > > > > > >> >> CSV,
> > > > > > >> >>>> JSON and XML files to Solr by changing that and using
> File
> > > > > > connector.
> > > > > > >> >> So I
> > > > > > >> >>>> wish we allow Tika extractor transformation connector to
> > > create
> > > > > XML
> > > > > > >> >> files
> > > > > > >> >>>> that Solr expects to see.
> > > > > > >> >>>>
> > > > > > >> >>>> Regards,
> > > > > > >> >>>> Shinichiro Abe
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <
> daddywri@gmail.com
> > >:
> > > > > > >> >>>>
> > > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
> >  Zaizi
> > > > said
> > > > > > >> they'd
> > > > > > >> >>>>> contribute a Tika extractor transformation connector -
> and
> > > if
> > > > > they
> > > > > > >> >> don't
> > > > > > >> >>>>> get around to that in a month or so, I may take a crack
> at
> > > it
> > > > > > >> myself.
> > > > > > >> >>>>>
> > > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> > to
> > > > the
> > > > > > >> >>>> extracting
> > > > > > >> >>>>> update handler, it would be great if:
> > > > > > >> >>>>> (1) Someone created a ticket for this, and
> > > > > > >> >>>>> (2) A patch was provided that maintains backwards
> > > > compatibility
> > > > > > with
> > > > > > >> >>>>> previous versions of the connector (so a checkbox would
> > > > probably
> > > > > > >> need
> > > > > > >> >> to
> > > > > > >> >>>> go
> > > > > > >> >>>>> into the UI somewhere).  Do either of you want to start
> > this
> > > > > > >> process?
> > > > > > >> >>>>>
> > > > > > >> >>>>> Thanks!
> > > > > > >> >>>>> Karl
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > > > > daddywri@gmail.com
> > > > > > >
> > > > > > >> >>>> wrote:
> > > > > > >> >>>>>
> > > > > > >> >>>>>> Hi guys,
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a
> > full
> > > > > > >> pipeline,
> > > > > > >> >>>> and
> > > > > > >> >>>>>> is expected to have a Tika extractor as a
> transformation
> > > > > > connector.
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> Karl
> > > > > > >> >>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > > > > >> >>>>> m.grolla@sourcesense.com>
> > > > > > >> >>>>>> wrote:
> > > > > > >> >>>>>>
> > > > > > >> >>>>>>> Thanks Alessandro,
> > > > > > >> >>>>>>>       that explains the situation clearly.
> > > > > > >> >>>>>>> And I agree that sending all the metadata as get
> > parameter
> > > > can
> > > > > > be
> > > > > > >> >>>>>>> problematic
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> Cheers
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> --
> > > > > > >> >>>>>>> Matteo Grolla
> > > > > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > > > > >> >>>>>>> http://www.sourcesense.com
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
> > > Benedetti
> > > > ha
> > > > > > >> >>>> scritto:
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > > > > extractors.
> > > > > > >> >>>>>>>> The Repository connectors extracts directly the
> binary
> > > and
> > > > > > there
> > > > > > >> is
> > > > > > >> >>>> no
> > > > > > >> >>>>>>>> "Extractor Processor" yet.
> > > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
> > been
> > > > > > thought
> > > > > > >> (
> > > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959
> )
> > > > > > >> >>>>>>>> So can fit there.
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> Cheers
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > > > > >> m.grolla@sourcesense.com
> > > > > > >> >>>>> :
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>> Since Solr extracting request handler takes the
> binary
> > > and
> > > > > > >> extracts
> > > > > > >> >>>>>>> text
> > > > > > >> >>>>>>>>> what is the point of not using Manifold extractor
> and
> > > send
> > > > > > text
> > > > > > >> and
> > > > > > >> >>>>>>>>> binaries to solr?
> > > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text
> > and
> > > > > stores
> > > > > > >> text
> > > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me
> > this
> > > > is
> > > > > > the
> > > > > > >> >>>> place
> > > > > > >> >>>>>>>>> where it should be done
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>> --
> > > > > > >> >>>>>>>>> Matteo Grolla
> > > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > > > > >> >>>>>>>>> http://www.sourcesense.com
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David
> > > Perez
> > > > > > >> Morales
> > > > > > >> >>>> ha
> > > > > > >> >>>>>>>>> scritto:
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>>> Hi Matteo
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the
> only
> > > way
> > > > > to
> > > > > > >> send
> > > > > > >> >>>>>>> binary
> > > > > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > > > > >> update/extract
> > > > > > >> >>>>>>>>> handler,
> > > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and
> > the
> > > > > binary
> > > > > > >> >>>>> content
> > > > > > >> >>>>>>> is
> > > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to
> > use
> > > > Tika
> > > > > > to
> > > > > > >> >>>>> obtain
> > > > > > >> >>>>>>> the
> > > > > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Regards
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > > > > >> >>>>>>> m.grolla@sourcesense.com
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> wrote:
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that
> manifold
> > > uses
> > > > > > Solr
> > > > > > >> >>>>>>> extracting
> > > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
> > file
> > > > > > >> >>>>>>>>>>> For performance reasons it would be better if
> > Manifold
> > > > > > handled
> > > > > > >> >>>> the
> > > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > > > > >> >>>>>>>>>>> Is this because of the connector design, framework
> > > > design
> > > > > or
> > > > > > >> just
> > > > > > >> >>>>> to
> > > > > > >> >>>>>>> be
> > > > > > >> >>>>>>>>>>> done?
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>> --
> > > > > > >> >>>>>>>>>>> Matteo Grolla
> > > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > > > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> --
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> ------------------------------
> > > > > > >> >>>>>>>>>> This message should be regarded as confidential. If
> > you
> > > > > have
> > > > > > >> >>>>> received
> > > > > > >> >>>>>>>>> this
> > > > > > >> >>>>>>>>>> email in error please notify the sender and destroy
> > it
> > > > > > >> >>>> immediately.
> > > > > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > > > > confirmed
> > > > > > >> in
> > > > > > >> >>>>> hard
> > > > > > >> >>>>>>>>> copy
> > > > > > >> >>>>>>>>>> by an authorised signatory.
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with
> the
> > > > > > >> registration
> > > > > > >> >>>>>>> number
> > > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > > > > Shepherds
> > > > > > >> Bush
> > > > > > >> >>>>>>> Road,
> > > > > > >> >>>>>>>>>> London W6 7AN.
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> --
> > > > > > >> >>>>>>>> --------------------------
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> Benedetti Alessandro
> > > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> "Tyger, tyger burning bright
> > > > > > >> >>>>>>>> In the forests of the night,
> > > > > > >> >>>>>>>> What immortal hand or eye
> > > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>> --
> > > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - -
> > > - -
> > > > > > >> >>>> Shinichiro Abe
> > > > > > >> >>>> 阿部 慎一朗
> > > > > > >> >>>>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --------------------------
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

2014-06-18 16:10 GMT+01:00 Karl Wright <da...@gmail.com>:

> Hi Alessandro,
>
> The reason for backwards compatibility is obvious: people upgrade
> ManifoldCF all the time, and when they do it should not stop working for
> them.
>
Ok i agree !

>
> Putting Tika all the time in the pipeline is also not appropriate for other
> output connections.


I don't agree on this. Why is not appropriate for all the connectors ?
The conceptual responsibility of an output Connector should be to post a
RespositoryDocument to an output ( whatever we want) .
A RepositoryDocument is a map Field-> value.
The content is nothing than a one of these fields.
So I can not see why after we have a RepositoryDocument ( with content
extracted) , should not be possible to send it independently to any
OutputConnector.


>  Even if you did it just for Solr, you'd then have to
> insure that the Tika transformer was exactly compatible with Solr Cell,
> which I would be very uncomfortable with agreeing to.
>

So what is the problem of using Tika outside Solr? We will add the most
recent version of Tika, that will be gradually upgraded over time with the
platform.

Solr Extract is using Tika under the hood, nothing more.



> So let's presume that you'd do one of two things.  Either:
>
> - Leave the existing Solr connector alone, and create a whole new Solr
> connector designed to work with a Tika transformer, or
> - Modify the existing Solr connector so that it operates in two possible
> modes, one of which supports the legacy model (the default), and one of
> which supports your new model
>

probably a simple flag can fit to operate in one way or another.

>
> If this sounds overly burdensome, I'm sorry but it's necessary until MCF
> 2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
> backwards compatibility, including legacy tabs that have outlived their
> usefulness, etc.  But that's not a 1.7 solution.
>
> Karl
>

Cheers

>
>
>
> On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > Hello Karl,
> > What i was thinking is:
> > assuming we have the Tika Connector, the responsibility to extract
> content
> > will pass from Solr to the Tika processor.
> >
> > So we can change the part in the Solr Connector that manages the building
> > of the request to send to the Extract update handler.
> > Particularly that part will change in the classic way: usually it's good
> to
> > build a SolrDocument in SolrJ and then add it to SolrServer.
> >
> > Why should we give retrocompatibility from Solr Connector point of view ?
> > From the user point of view, a Job will be selected with the Tika
> Conenctor
> > in the pipeline, so we are providing the same identical feature.
> > One way can be to make the Tika Processor Connector by default in the
> > pipeline, and someone will be able to deactivate it only if needed.
> >
> > Cheers
> >
> >
> >
> > 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
> >
> > > Hi Alessandro,
> > > What is your concrete proposal to change the Solr connector?  Bear in
> > mind
> > > that we do need to maintain backwards compatibility.  If you list your
> > > specific changes, not in any huge detail, but with enough detail that
> we
> > > understand your proposal, that would help.  What happens to the UI?
>  What
> > > happens to the internals?
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> > >
> > > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> > > benedetti.alex85@gmail.com> wrote:
> > >
> > > > But guys, why not simply pass to a classic SolrJ SolrDocument
> creation
> > > and
> > > > ingestion in the Solr Server ? Easy and Straighforward !
> > > >
> > > > In the end at that point the RepositoryDocument will me only a Map of
> > > > metadata and values.
> > > > Content will be part of that, so I guess the conversion to a
> > SolrDocument
> > > > will be immediate.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
> > > >
> > > > > Hi Abe-san,
> > > > >
> > > > > Near as I can tell, the major consumer of disk space is the Maven
> > > target
> > > > > directories.  This is generating many tens of megabytes of
> temporary
> > > disk
> > > > > usage for every connector.  Luckily if you use ant, this is not a
> > > > problem.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Abe-san,
> > > > > >
> > > > > > Tika jars are not very big:
> > > > > >
> > > > > > C:\wip\mcf\trunk\lib>dir tika*
> > > > > >  Volume in drive C has no label.
> > > > > >  Volume Serial Number is 002E-D1F0
> > > > > >
> > > > > >  Directory of C:\wip\mcf\trunk\lib
> > > > > >
> > > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > > > >                2 File(s)      1,017,051 bytes
> > > > > >                0 Dir(s)  140,792,315,904 bytes free
> > > > > >
> > > > > > The entire lib directory is 85M:
> > > > > >
> > > > > > 85,156,330 bytes
> > > > > >
> > > > > > The built binary image is still about 185Mb, I believe.  So I
> don't
> > > > know
> > > > > > why you think it is >1Gb?  Temporary class files?  I don't think
> we
> > > can
> > > > > > avoid those.
> > > > > >
> > > > > > I'd rather not make things more complicated than they need to be
> by
> > > > > adding
> > > > > > a new required service - even though it would fit naturally with
> > the
> > > > > > connector arrangement.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > > >
> > > > > >> Hi Karl,
> > > > > >>
> > > > > >> Okay, I assumed Tika connector outputs files.
> > > > > >> If we post character data metadata got from Tika,
> > "/update/extract"
> > > > > >> handler
> > > > > >> can handle this(provides params:
> > > > > >> literal.content=value&literal.metaField=foobar
> > > > > >> with using NullInputStream for binary data like CONNECTORS-936).
> > > > > >>
> > > > > >> BTW, now trunk built size is too big(1G+). Maybe because
> > CloudSearch
> > > > > >> connector uses Tika jars.
> > > > > >> Tika connector and CloudSearch connector should extract text via
> > > > > >> tika-server[1]
> > > > > >> and MCF should not have many Tika jars, do you think?
> > > > > >>
> > > > > >> [1]
> > > > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Shinichiro Abe
> > > > > >>
> > > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> > > > > >>
> > > > > >> > Hi Abe-san,
> > > > > >> >
> > > > > >> > It sounds like you might be thinking that transformation
> > > connectors
> > > > > are
> > > > > >> > like output connectors.  Just so we are clear, transformation
> > > > > >> connectors in
> > > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > > > >> > RepositoryDocument on to the next connector in the chain.  So
> I
> > > > don't
> > > > > >> know
> > > > > >> > why .xml files would be involved.  I'd expect the Tika
> connector
> > > to
> > > > > >> read a
> > > > > >> > binary file from one RepositoryDocument object and convert its
> > > > > contents
> > > > > >> to
> > > > > >> > another RepositoryDocument object which would have character
> > data
> > > > and
> > > > > >> > metadata only.  Would this work for your case, do you think?
> > > > > >> >
> > > > > >> > Karl
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > > > >> shinichiro.abe.1@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Hi Karl,
> > > > > >> >>
> > > > > >> >> Yes. I thought the standard update handler met that
> > requirement.
> > > > > >> >> For instance, Tika extractor transformation connector creates
> > two
> > > > > >> files.
> > > > > >> >> 1. addtoSolr.xml for add and update
> > > > > >> >> 2. deletetoSolr.xml for delete
> > > > > >> >> File connector ingests these xml files, then Solr connector
> > posts
> > > > > these
> > > > > >> >> files by "/update" handler.
> > > > > >> >>
> > > > > >> >> In the the Solr Connector, other function as to update
> handler
> > > > > >> >> might not be necessary except for  "/update" handler.
> > > > > >> >>
> > > > > >> >> Thanks,
> > > > > >> >> Shinichiro Abe
> > > > > >> >>
> > > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
> > wrote:
> > > > > >> >>
> > > > > >> >>> Hi Abe-san,
> > > > > >> >>>
> > > > > >> >>> So just to be sure -- you believe that no changes at all are
> > > > > required
> > > > > >> to
> > > > > >> >>> the Solr Connector as it stands now, other than to use the
> > > update
> > > > > >> handler
> > > > > >> >>> rather than the /update/extract handler?
> > > > > >> >>>
> > > > > >> >>> Karl
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > > > >> >> shinichiro.abe.1@gmail.com>
> > > > > >> >>> wrote:
> > > > > >> >>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >> extracting
> > > > > >> >>>> update handler
> > > > > >> >>>>
> > > > > >> >>>> I don't think it needs to change Solr connector with new
> > > checkbox
> > > > > >> >> because
> > > > > >> >>>> currently we can change "/update/extract" into "/update" at
> > > > 'Update
> > > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I
> > could
> > > > > post
> > > > > >> >> CSV,
> > > > > >> >>>> JSON and XML files to Solr by changing that and using File
> > > > > connector.
> > > > > >> >> So I
> > > > > >> >>>> wish we allow Tika extractor transformation connector to
> > create
> > > > XML
> > > > > >> >> files
> > > > > >> >>>> that Solr expects to see.
> > > > > >> >>>>
> > > > > >> >>>> Regards,
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com
> >:
> > > > > >> >>>>
> > > > > >> >>>>> The pipeline code itself is now "complete" in trunk.
>  Zaizi
> > > said
> > > > > >> they'd
> > > > > >> >>>>> contribute a Tika extractor transformation connector - and
> > if
> > > > they
> > > > > >> >> don't
> > > > > >> >>>>> get around to that in a month or so, I may take a crack at
> > it
> > > > > >> myself.
> > > > > >> >>>>>
> > > > > >> >>>>> As for changing the Solr connector so that it doesn't go
> to
> > > the
> > > > > >> >>>> extracting
> > > > > >> >>>>> update handler, it would be great if:
> > > > > >> >>>>> (1) Someone created a ticket for this, and
> > > > > >> >>>>> (2) A patch was provided that maintains backwards
> > > compatibility
> > > > > with
> > > > > >> >>>>> previous versions of the connector (so a checkbox would
> > > probably
> > > > > >> need
> > > > > >> >> to
> > > > > >> >>>> go
> > > > > >> >>>>> into the UI somewhere).  Do either of you want to start
> this
> > > > > >> process?
> > > > > >> >>>>>
> > > > > >> >>>>> Thanks!
> > > > > >> >>>>> Karl
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > > > daddywri@gmail.com
> > > > > >
> > > > > >> >>>> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>> Hi guys,
> > > > > >> >>>>>>
> > > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a
> full
> > > > > >> pipeline,
> > > > > >> >>>> and
> > > > > >> >>>>>> is expected to have a Tika extractor as a transformation
> > > > > connector.
> > > > > >> >>>>>>
> > > > > >> >>>>>> Karl
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > > > >> >>>>> m.grolla@sourcesense.com>
> > > > > >> >>>>>> wrote:
> > > > > >> >>>>>>
> > > > > >> >>>>>>> Thanks Alessandro,
> > > > > >> >>>>>>>       that explains the situation clearly.
> > > > > >> >>>>>>> And I agree that sending all the metadata as get
> parameter
> > > can
> > > > > be
> > > > > >> >>>>>>> problematic
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Cheers
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> --
> > > > > >> >>>>>>> Matteo Grolla
> > > > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
> > Benedetti
> > > ha
> > > > > >> >>>> scritto:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > > > extractors.
> > > > > >> >>>>>>>> The Repository connectors extracts directly the binary
> > and
> > > > > there
> > > > > >> is
> > > > > >> >>>> no
> > > > > >> >>>>>>>> "Extractor Processor" yet.
> > > > > >> >>>>>>>> But recently a pipe-line processor architecture has
> been
> > > > > thought
> > > > > >> (
> > > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > > > > >> >>>>>>>> So can fit there.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Cheers
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > > > >> m.grolla@sourcesense.com
> > > > > >> >>>>> :
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>> Since Solr extracting request handler takes the binary
> > and
> > > > > >> extracts
> > > > > >> >>>>>>> text
> > > > > >> >>>>>>>>> what is the point of not using Manifold extractor and
> > send
> > > > > text
> > > > > >> and
> > > > > >> >>>>>>>>> binaries to solr?
> > > > > >> >>>>>>>>> I mean the end result is the same solr indexes text
> and
> > > > stores
> > > > > >> text
> > > > > >> >>>>>>>>> So if manifold supports text extraction it seems me
> this
> > > is
> > > > > the
> > > > > >> >>>> place
> > > > > >> >>>>>>>>> where it should be done
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> --
> > > > > >> >>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David
> > Perez
> > > > > >> Morales
> > > > > >> >>>> ha
> > > > > >> >>>>>>>>> scritto:
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>> Hi Matteo
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Manifold already handles the extraction, but the only
> > way
> > > > to
> > > > > >> send
> > > > > >> >>>>>>> binary
> > > > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > > > >> update/extract
> > > > > >> >>>>>>>>> handler,
> > > > > >> >>>>>>>>>> where the metadata is sent as query parameters and
> the
> > > > binary
> > > > > >> >>>>> content
> > > > > >> >>>>>>> is
> > > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to
> use
> > > Tika
> > > > > to
> > > > > >> >>>>> obtain
> > > > > >> >>>>>>> the
> > > > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Regards
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > > > >> >>>>>>> m.grolla@sourcesense.com
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> wrote:
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold
> > uses
> > > > > Solr
> > > > > >> >>>>>>> extracting
> > > > > >> >>>>>>>>>>> request handler to extract the content of an xml
> file
> > > > > >> >>>>>>>>>>> For performance reasons it would be better if
> Manifold
> > > > > handled
> > > > > >> >>>> the
> > > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > > > >> >>>>>>>>>>> Is this because of the connector design, framework
> > > design
> > > > or
> > > > > >> just
> > > > > >> >>>>> to
> > > > > >> >>>>>>> be
> > > > > >> >>>>>>>>>>> done?
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> --
> > > > > >> >>>>>>>>>>> Matteo Grolla
> > > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> --
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> ------------------------------
> > > > > >> >>>>>>>>>> This message should be regarded as confidential. If
> you
> > > > have
> > > > > >> >>>>> received
> > > > > >> >>>>>>>>> this
> > > > > >> >>>>>>>>>> email in error please notify the sender and destroy
> it
> > > > > >> >>>> immediately.
> > > > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > > > confirmed
> > > > > >> in
> > > > > >> >>>>> hard
> > > > > >> >>>>>>>>> copy
> > > > > >> >>>>>>>>>> by an authorised signatory.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > > > > >> registration
> > > > > >> >>>>>>> number
> > > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > > > Shepherds
> > > > > >> Bush
> > > > > >> >>>>>>> Road,
> > > > > >> >>>>>>>>>> London W6 7AN.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> --
> > > > > >> >>>>>>>> --------------------------
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Benedetti Alessandro
> > > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> "Tyger, tyger burning bright
> > > > > >> >>>>>>>> In the forests of the night,
> > > > > >> >>>>>>>> What immortal hand or eye
> > > > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> --
> > > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > - -
> > > > > >> >>>> Shinichiro Abe
> > > > > >> >>>> 阿部 慎一朗
> > > > > >> >>>>
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Alessandro,

The reason for backwards compatibility is obvious: people upgrade
ManifoldCF all the time, and when they do it should not stop working for
them.

Putting Tika all the time in the pipeline is also not appropriate for other
output connections.  Even if you did it just for Solr, you'd then have to
insure that the Tika transformer was exactly compatible with Solr Cell,
which I would be very uncomfortable with agreeing to.

So let's presume that you'd do one of two things.  Either:

- Leave the existing Solr connector alone, and create a whole new Solr
connector designed to work with a Tika transformer, or
- Modify the existing Solr connector so that it operates in two possible
modes, one of which supports the legacy model (the default), and one of
which supports your new model

If this sounds overly burdensome, I'm sorry but it's necessary until MCF
2.0.  For MCF 2.0, which I've begun to think about, we can dispense with
backwards compatibility, including legacy tabs that have outlived their
usefulness, etc.  But that's not a 1.7 solution.

Karl



On Wed, Jun 18, 2014 at 10:16 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Hello Karl,
> What i was thinking is:
> assuming we have the Tika Connector, the responsibility to extract content
> will pass from Solr to the Tika processor.
>
> So we can change the part in the Solr Connector that manages the building
> of the request to send to the Extract update handler.
> Particularly that part will change in the classic way: usually it's good to
> build a SolrDocument in SolrJ and then add it to SolrServer.
>
> Why should we give retrocompatibility from Solr Connector point of view ?
> From the user point of view, a Job will be selected with the Tika Conenctor
> in the pipeline, so we are providing the same identical feature.
> One way can be to make the Tika Processor Connector by default in the
> pipeline, and someone will be able to deactivate it only if needed.
>
> Cheers
>
>
>
> 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
>
> > Hi Alessandro,
> > What is your concrete proposal to change the Solr connector?  Bear in
> mind
> > that we do need to maintain backwards compatibility.  If you list your
> > specific changes, not in any huge detail, but with enough detail that we
> > understand your proposal, that would help.  What happens to the UI?  What
> > happens to the internals?
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> > benedetti.alex85@gmail.com> wrote:
> >
> > > But guys, why not simply pass to a classic SolrJ SolrDocument creation
> > and
> > > ingestion in the Solr Server ? Easy and Straighforward !
> > >
> > > In the end at that point the RepositoryDocument will me only a Map of
> > > metadata and values.
> > > Content will be part of that, so I guess the conversion to a
> SolrDocument
> > > will be immediate.
> > >
> > > Cheers
> > >
> > >
> > > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
> > >
> > > > Hi Abe-san,
> > > >
> > > > Near as I can tell, the major consumer of disk space is the Maven
> > target
> > > > directories.  This is generating many tens of megabytes of temporary
> > disk
> > > > usage for every connector.  Luckily if you use ant, this is not a
> > > problem.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Abe-san,
> > > > >
> > > > > Tika jars are not very big:
> > > > >
> > > > > C:\wip\mcf\trunk\lib>dir tika*
> > > > >  Volume in drive C has no label.
> > > > >  Volume Serial Number is 002E-D1F0
> > > > >
> > > > >  Directory of C:\wip\mcf\trunk\lib
> > > > >
> > > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > > >                2 File(s)      1,017,051 bytes
> > > > >                0 Dir(s)  140,792,315,904 bytes free
> > > > >
> > > > > The entire lib directory is 85M:
> > > > >
> > > > > 85,156,330 bytes
> > > > >
> > > > > The built binary image is still about 185Mb, I believe.  So I don't
> > > know
> > > > > why you think it is >1Gb?  Temporary class files?  I don't think we
> > can
> > > > > avoid those.
> > > > >
> > > > > I'd rather not make things more complicated than they need to be by
> > > > adding
> > > > > a new required service - even though it would fit naturally with
> the
> > > > > connector arrangement.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > >
> > > > >> Hi Karl,
> > > > >>
> > > > >> Okay, I assumed Tika connector outputs files.
> > > > >> If we post character data metadata got from Tika,
> "/update/extract"
> > > > >> handler
> > > > >> can handle this(provides params:
> > > > >> literal.content=value&literal.metaField=foobar
> > > > >> with using NullInputStream for binary data like CONNECTORS-936).
> > > > >>
> > > > >> BTW, now trunk built size is too big(1G+). Maybe because
> CloudSearch
> > > > >> connector uses Tika jars.
> > > > >> Tika connector and CloudSearch connector should extract text via
> > > > >> tika-server[1]
> > > > >> and MCF should not have many Tika jars, do you think?
> > > > >>
> > > > >> [1]
> > > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > > >>
> > > > >> Thanks,
> > > > >> Shinichiro Abe
> > > > >>
> > > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> > > > >>
> > > > >> > Hi Abe-san,
> > > > >> >
> > > > >> > It sounds like you might be thinking that transformation
> > connectors
> > > > are
> > > > >> > like output connectors.  Just so we are clear, transformation
> > > > >> connectors in
> > > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > > >> > RepositoryDocument on to the next connector in the chain.  So I
> > > don't
> > > > >> know
> > > > >> > why .xml files would be involved.  I'd expect the Tika connector
> > to
> > > > >> read a
> > > > >> > binary file from one RepositoryDocument object and convert its
> > > > contents
> > > > >> to
> > > > >> > another RepositoryDocument object which would have character
> data
> > > and
> > > > >> > metadata only.  Would this work for your case, do you think?
> > > > >> >
> > > > >> > Karl
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > > >> shinichiro.abe.1@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Hi Karl,
> > > > >> >>
> > > > >> >> Yes. I thought the standard update handler met that
> requirement.
> > > > >> >> For instance, Tika extractor transformation connector creates
> two
> > > > >> files.
> > > > >> >> 1. addtoSolr.xml for add and update
> > > > >> >> 2. deletetoSolr.xml for delete
> > > > >> >> File connector ingests these xml files, then Solr connector
> posts
> > > > these
> > > > >> >> files by "/update" handler.
> > > > >> >>
> > > > >> >> In the the Solr Connector, other function as to update handler
> > > > >> >> might not be necessary except for  "/update" handler.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> Shinichiro Abe
> > > > >> >>
> > > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com>
> wrote:
> > > > >> >>
> > > > >> >>> Hi Abe-san,
> > > > >> >>>
> > > > >> >>> So just to be sure -- you believe that no changes at all are
> > > > required
> > > > >> to
> > > > >> >>> the Solr Connector as it stands now, other than to use the
> > update
> > > > >> handler
> > > > >> >>> rather than the /update/extract handler?
> > > > >> >>>
> > > > >> >>> Karl
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > > >> >> shinichiro.abe.1@gmail.com>
> > > > >> >>> wrote:
> > > > >> >>>
> > > > >> >>>>> As for changing the Solr connector so that it doesn't go to
> > the
> > > > >> >> extracting
> > > > >> >>>> update handler
> > > > >> >>>>
> > > > >> >>>> I don't think it needs to change Solr connector with new
> > checkbox
> > > > >> >> because
> > > > >> >>>> currently we can change "/update/extract" into "/update" at
> > > 'Update
> > > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I
> could
> > > > post
> > > > >> >> CSV,
> > > > >> >>>> JSON and XML files to Solr by changing that and using File
> > > > connector.
> > > > >> >> So I
> > > > >> >>>> wish we allow Tika extractor transformation connector to
> create
> > > XML
> > > > >> >> files
> > > > >> >>>> that Solr expects to see.
> > > > >> >>>>
> > > > >> >>>> Regards,
> > > > >> >>>> Shinichiro Abe
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> > > > >> >>>>
> > > > >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi
> > said
> > > > >> they'd
> > > > >> >>>>> contribute a Tika extractor transformation connector - and
> if
> > > they
> > > > >> >> don't
> > > > >> >>>>> get around to that in a month or so, I may take a crack at
> it
> > > > >> myself.
> > > > >> >>>>>
> > > > >> >>>>> As for changing the Solr connector so that it doesn't go to
> > the
> > > > >> >>>> extracting
> > > > >> >>>>> update handler, it would be great if:
> > > > >> >>>>> (1) Someone created a ticket for this, and
> > > > >> >>>>> (2) A patch was provided that maintains backwards
> > compatibility
> > > > with
> > > > >> >>>>> previous versions of the connector (so a checkbox would
> > probably
> > > > >> need
> > > > >> >> to
> > > > >> >>>> go
> > > > >> >>>>> into the UI somewhere).  Do either of you want to start this
> > > > >> process?
> > > > >> >>>>>
> > > > >> >>>>> Thanks!
> > > > >> >>>>> Karl
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > > daddywri@gmail.com
> > > > >
> > > > >> >>>> wrote:
> > > > >> >>>>>
> > > > >> >>>>>> Hi guys,
> > > > >> >>>>>>
> > > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> > > > >> pipeline,
> > > > >> >>>> and
> > > > >> >>>>>> is expected to have a Tika extractor as a transformation
> > > > connector.
> > > > >> >>>>>>
> > > > >> >>>>>> Karl
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > > >> >>>>> m.grolla@sourcesense.com>
> > > > >> >>>>>> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>>> Thanks Alessandro,
> > > > >> >>>>>>>       that explains the situation clearly.
> > > > >> >>>>>>> And I agree that sending all the metadata as get parameter
> > can
> > > > be
> > > > >> >>>>>>> problematic
> > > > >> >>>>>>>
> > > > >> >>>>>>> Cheers
> > > > >> >>>>>>>
> > > > >> >>>>>>> --
> > > > >> >>>>>>> Matteo Grolla
> > > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > > >> >>>>>>> http://www.sourcesense.com
> > > > >> >>>>>>>
> > > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
> Benedetti
> > ha
> > > > >> >>>> scritto:
> > > > >> >>>>>>>
> > > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > > extractors.
> > > > >> >>>>>>>> The Repository connectors extracts directly the binary
> and
> > > > there
> > > > >> is
> > > > >> >>>> no
> > > > >> >>>>>>>> "Extractor Processor" yet.
> > > > >> >>>>>>>> But recently a pipe-line processor architecture has been
> > > > thought
> > > > >> (
> > > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > > > >> >>>>>>>> So can fit there.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Cheers
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > > >> m.grolla@sourcesense.com
> > > > >> >>>>> :
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>> Since Solr extracting request handler takes the binary
> and
> > > > >> extracts
> > > > >> >>>>>>> text
> > > > >> >>>>>>>>> what is the point of not using Manifold extractor and
> send
> > > > text
> > > > >> and
> > > > >> >>>>>>>>> binaries to solr?
> > > > >> >>>>>>>>> I mean the end result is the same solr indexes text and
> > > stores
> > > > >> text
> > > > >> >>>>>>>>> So if manifold supports text extraction it seems me this
> > is
> > > > the
> > > > >> >>>> place
> > > > >> >>>>>>>>> where it should be done
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> --
> > > > >> >>>>>>>>> Matteo Grolla
> > > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > > >> >>>>>>>>> http://www.sourcesense.com
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David
> Perez
> > > > >> Morales
> > > > >> >>>> ha
> > > > >> >>>>>>>>> scritto:
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>> Hi Matteo
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Manifold already handles the extraction, but the only
> way
> > > to
> > > > >> send
> > > > >> >>>>>>> binary
> > > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > > >> update/extract
> > > > >> >>>>>>>>> handler,
> > > > >> >>>>>>>>>> where the metadata is sent as query parameters and the
> > > binary
> > > > >> >>>>> content
> > > > >> >>>>>>> is
> > > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use
> > Tika
> > > > to
> > > > >> >>>>> obtain
> > > > >> >>>>>>> the
> > > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Regards
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > > >> >>>>>>> m.grolla@sourcesense.com
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> wrote:
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold
> uses
> > > > Solr
> > > > >> >>>>>>> extracting
> > > > >> >>>>>>>>>>> request handler to extract the content of an xml file
> > > > >> >>>>>>>>>>> For performance reasons it would be better if Manifold
> > > > handled
> > > > >> >>>> the
> > > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > > >> >>>>>>>>>>> Is this because of the connector design, framework
> > design
> > > or
> > > > >> just
> > > > >> >>>>> to
> > > > >> >>>>>>> be
> > > > >> >>>>>>>>>>> done?
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> --
> > > > >> >>>>>>>>>>> Matteo Grolla
> > > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> --
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> ------------------------------
> > > > >> >>>>>>>>>> This message should be regarded as confidential. If you
> > > have
> > > > >> >>>>> received
> > > > >> >>>>>>>>> this
> > > > >> >>>>>>>>>> email in error please notify the sender and destroy it
> > > > >> >>>> immediately.
> > > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > > confirmed
> > > > >> in
> > > > >> >>>>> hard
> > > > >> >>>>>>>>> copy
> > > > >> >>>>>>>>>> by an authorised signatory.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > > > >> registration
> > > > >> >>>>>>> number
> > > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > > Shepherds
> > > > >> Bush
> > > > >> >>>>>>> Road,
> > > > >> >>>>>>>>>> London W6 7AN.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> --
> > > > >> >>>>>>>> --------------------------
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Benedetti Alessandro
> > > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> "Tyger, tyger burning bright
> > > > >> >>>>>>>> In the forests of the night,
> > > > >> >>>>>>>> What immortal hand or eye
> > > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> --
> > > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - -
> > > > >> >>>> Shinichiro Abe
> > > > >> >>>> 阿部 慎一朗
> > > > >> >>>>
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Solr Extracting request handler

Posted by Matteo Grolla <m....@sourcesense.com>.

Hi Alessandro,
	ideally I think that text extraction from rich documents should be Manifold responsibility, not Solr's
So the ideal place to implement it would be in the new document processing pipeline (using Tika)

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 18/giu/2014, alle ore 16:16, Alessandro Benedetti ha scritto:

> Hello Karl,
> What i was thinking is:
> assuming we have the Tika Connector, the responsibility to extract content
> will pass from Solr to the Tika processor.
> 
> So we can change the part in the Solr Connector that manages the building
> of the request to send to the Extract update handler.
> Particularly that part will change in the classic way: usually it's good to
> build a SolrDocument in SolrJ and then add it to SolrServer.
> 
> Why should we give retrocompatibility from Solr Connector point of view ?
> From the user point of view, a Job will be selected with the Tika Conenctor
> in the pipeline, so we are providing the same identical feature.
> One way can be to make the Tika Processor Connector by default in the
> pipeline, and someone will be able to deactivate it only if needed.
> 
> Cheers
> 
> 
> 
> 2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:
> 
>> Hi Alessandro,
>> What is your concrete proposal to change the Solr connector?  Bear in mind
>> that we do need to maintain backwards compatibility.  If you list your
>> specific changes, not in any huge detail, but with enough detail that we
>> understand your proposal, that would help.  What happens to the UI?  What
>> happens to the internals?
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
>> benedetti.alex85@gmail.com> wrote:
>> 
>>> But guys, why not simply pass to a classic SolrJ SolrDocument creation
>> and
>>> ingestion in the Solr Server ? Easy and Straighforward !
>>> 
>>> In the end at that point the RepositoryDocument will me only a Map of
>>> metadata and values.
>>> Content will be part of that, so I guess the conversion to a SolrDocument
>>> will be immediate.
>>> 
>>> Cheers
>>> 
>>> 
>>> 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
>>> 
>>>> Hi Abe-san,
>>>> 
>>>> Near as I can tell, the major consumer of disk space is the Maven
>> target
>>>> directories.  This is generating many tens of megabytes of temporary
>> disk
>>>> usage for every connector.  Luckily if you use ant, this is not a
>>> problem.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Abe-san,
>>>>> 
>>>>> Tika jars are not very big:
>>>>> 
>>>>> C:\wip\mcf\trunk\lib>dir tika*
>>>>> Volume in drive C has no label.
>>>>> Volume Serial Number is 002E-D1F0
>>>>> 
>>>>> Directory of C:\wip\mcf\trunk\lib
>>>>> 
>>>>> 06/05/2014  08:21 AM           493,374 tika-core.jar
>>>>> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>>>>>               2 File(s)      1,017,051 bytes
>>>>>               0 Dir(s)  140,792,315,904 bytes free
>>>>> 
>>>>> The entire lib directory is 85M:
>>>>> 
>>>>> 85,156,330 bytes
>>>>> 
>>>>> The built binary image is still about 185Mb, I believe.  So I don't
>>> know
>>>>> why you think it is >1Gb?  Temporary class files?  I don't think we
>> can
>>>>> avoid those.
>>>>> 
>>>>> I'd rather not make things more complicated than they need to be by
>>>> adding
>>>>> a new required service - even though it would fit naturally with the
>>>>> connector arrangement.
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
>>>>> shinichiro.abe.1@gmail.com> wrote:
>>>>> 
>>>>>> Hi Karl,
>>>>>> 
>>>>>> Okay, I assumed Tika connector outputs files.
>>>>>> If we post character data metadata got from Tika, "/update/extract"
>>>>>> handler
>>>>>> can handle this(provides params:
>>>>>> literal.content=value&literal.metaField=foobar
>>>>>> with using NullInputStream for binary data like CONNECTORS-936).
>>>>>> 
>>>>>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>>>>>> connector uses Tika jars.
>>>>>> Tika connector and CloudSearch connector should extract text via
>>>>>> tika-server[1]
>>>>>> and MCF should not have many Tika jars, do you think?
>>>>>> 
>>>>>> [1]
>>>>>> http://wiki.apache.org/tika/TikaJAXRS
>>>>>> 
>>>>>> Thanks,
>>>>>> Shinichiro Abe
>>>>>> 
>>>>>> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Abe-san,
>>>>>>> 
>>>>>>> It sounds like you might be thinking that transformation
>> connectors
>>>> are
>>>>>>> like output connectors.  Just so we are clear, transformation
>>>>>> connectors in
>>>>>>> 1.7 receive a RepositoryDocument as input, and then pass a
>>>>>>> RepositoryDocument on to the next connector in the chain.  So I
>>> don't
>>>>>> know
>>>>>>> why .xml files would be involved.  I'd expect the Tika connector
>> to
>>>>>> read a
>>>>>>> binary file from one RepositoryDocument object and convert its
>>>> contents
>>>>>> to
>>>>>>> another RepositoryDocument object which would have character data
>>> and
>>>>>>> metadata only.  Would this work for your case, do you think?
>>>>>>> 
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>>>>>> shinichiro.abe.1@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Karl,
>>>>>>>> 
>>>>>>>> Yes. I thought the standard update handler met that requirement.
>>>>>>>> For instance, Tika extractor transformation connector creates two
>>>>>> files.
>>>>>>>> 1. addtoSolr.xml for add and update
>>>>>>>> 2. deletetoSolr.xml for delete
>>>>>>>> File connector ingests these xml files, then Solr connector posts
>>>> these
>>>>>>>> files by "/update" handler.
>>>>>>>> 
>>>>>>>> In the the Solr Connector, other function as to update handler
>>>>>>>> might not be necessary except for  "/update" handler.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Shinichiro Abe
>>>>>>>> 
>>>>>>>> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Abe-san,
>>>>>>>>> 
>>>>>>>>> So just to be sure -- you believe that no changes at all are
>>>> required
>>>>>> to
>>>>>>>>> the Solr Connector as it stands now, other than to use the
>> update
>>>>>> handler
>>>>>>>>> rather than the /update/extract handler?
>>>>>>>>> 
>>>>>>>>> Karl
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>>>>>>> shinichiro.abe.1@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> As for changing the Solr connector so that it doesn't go to
>> the
>>>>>>>> extracting
>>>>>>>>>> update handler
>>>>>>>>>> 
>>>>>>>>>> I don't think it needs to change Solr connector with new
>> checkbox
>>>>>>>> because
>>>>>>>>>> currently we can change "/update/extract" into "/update" at
>>> 'Update
>>>>>>>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
>>>> post
>>>>>>>> CSV,
>>>>>>>>>> JSON and XML files to Solr by changing that and using File
>>>> connector.
>>>>>>>> So I
>>>>>>>>>> wish we allow Tika extractor transformation connector to create
>>> XML
>>>>>>>> files
>>>>>>>>>> that Solr expects to see.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>>>>>>>>>> 
>>>>>>>>>>> The pipeline code itself is now "complete" in trunk.  Zaizi
>> said
>>>>>> they'd
>>>>>>>>>>> contribute a Tika extractor transformation connector - and if
>>> they
>>>>>>>> don't
>>>>>>>>>>> get around to that in a month or so, I may take a crack at it
>>>>>> myself.
>>>>>>>>>>> 
>>>>>>>>>>> As for changing the Solr connector so that it doesn't go to
>> the
>>>>>>>>>> extracting
>>>>>>>>>>> update handler, it would be great if:
>>>>>>>>>>> (1) Someone created a ticket for this, and
>>>>>>>>>>> (2) A patch was provided that maintains backwards
>> compatibility
>>>> with
>>>>>>>>>>> previous versions of the connector (so a checkbox would
>> probably
>>>>>> need
>>>>>>>> to
>>>>>>>>>> go
>>>>>>>>>>> into the UI somewhere).  Do either of you want to start this
>>>>>> process?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Karl
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
>>> daddywri@gmail.com
>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>> 
>>>>>>>>>>>> You folks may not have looked at 1.7 yet, but it has a full
>>>>>> pipeline,
>>>>>>>>>> and
>>>>>>>>>>>> is expected to have a Tika extractor as a transformation
>>>> connector.
>>>>>>>>>>>> 
>>>>>>>>>>>> Karl
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>>>>>>>> m.grolla@sourcesense.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks Alessandro,
>>>>>>>>>>>>>      that explains the situation clearly.
>>>>>>>>>>>>> And I agree that sending all the metadata as get parameter
>> can
>>>> be
>>>>>>>>>>>>> problematic
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti
>> ha
>>>>>>>>>> scritto:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mmmm the point is that right now ManifoldCF has no
>>> extractors.
>>>>>>>>>>>>>> The Repository connectors extracts directly the binary and
>>>> there
>>>>>> is
>>>>>>>>>> no
>>>>>>>>>>>>>> "Extractor Processor" yet.
>>>>>>>>>>>>>> But recently a pipe-line processor architecture has been
>>>> thought
>>>>>> (
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>>>>>>>> So can fit there.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>>>>>> m.grolla@sourcesense.com
>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Since Solr extracting request handler takes the binary and
>>>>>> extracts
>>>>>>>>>>>>> text
>>>>>>>>>>>>>>> what is the point of not using Manifold extractor and send
>>>> text
>>>>>> and
>>>>>>>>>>>>>>> binaries to solr?
>>>>>>>>>>>>>>> I mean the end result is the same solr indexes text and
>>> stores
>>>>>> text
>>>>>>>>>>>>>>> So if manifold supports text extraction it seems me this
>> is
>>>> the
>>>>>>>>>> place
>>>>>>>>>>>>>>> where it should be done
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
>>>>>> Morales
>>>>>>>>>> ha
>>>>>>>>>>>>>>> scritto:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Matteo
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Manifold already handles the extraction, but the only way
>>> to
>>>>>> send
>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>>> content and document metadata to Solr is using the
>>>>>> update/extract
>>>>>>>>>>>>>>> handler,
>>>>>>>>>>>>>>>> where the metadata is sent as query parameters and the
>>> binary
>>>>>>>>>>> content
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> sent in the body of the requests, allowing Solr to use
>> Tika
>>>> to
>>>>>>>>>>> obtain
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>>>>>>>> m.grolla@sourcesense.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
>>>> Solr
>>>>>>>>>>>>> extracting
>>>>>>>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>>>>>>>> For performance reasons it would be better if Manifold
>>>> handled
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>>>>>>>> Is this because of the connector design, framework
>> design
>>> or
>>>>>> just
>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> done?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>> This message should be regarded as confidential. If you
>>> have
>>>>>>>>>>> received
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> email in error please notify the sender and destroy it
>>>>>>>>>> immediately.
>>>>>>>>>>>>>>>> Statements of intent shall only become binding when
>>> confirmed
>>>>>> in
>>>>>>>>>>> hard
>>>>>>>>>>>>>>> copy
>>>>>>>>>>>>>>>> by an authorised signatory.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>>>>>> registration
>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> 6440931. The Registered Office is Brook House, 229
>>> Shepherds
>>>>>> Bush
>>>>>>>>>>>>> Road,
>>>>>>>>>>>>>>>> London W6 7AN.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Benedetti Alessandro
>>>>>>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>>>>>>> In the forests of the night,
>>>>>>>>>>>>>> What immortal hand or eye
>>>>>>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>> 阿部 慎一朗
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> --------------------------
>>> 
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>>> 
>> 
> 
> 
> 
> -- 
> --------------------------
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

Hello Karl,
What i was thinking is:
assuming we have the Tika Connector, the responsibility to extract content
will pass from Solr to the Tika processor.

So we can change the part in the Solr Connector that manages the building
of the request to send to the Extract update handler.
Particularly that part will change in the classic way: usually it's good to
build a SolrDocument in SolrJ and then add it to SolrServer.

Why should we give retrocompatibility from Solr Connector point of view ?
>From the user point of view, a Job will be selected with the Tika Conenctor
in the pipeline, so we are providing the same identical feature.
One way can be to make the Tika Processor Connector by default in the
pipeline, and someone will be able to deactivate it only if needed.

Cheers



2014-06-18 14:32 GMT+01:00 Karl Wright <da...@gmail.com>:

> Hi Alessandro,
> What is your concrete proposal to change the Solr connector?  Bear in mind
> that we do need to maintain backwards compatibility.  If you list your
> specific changes, not in any huge detail, but with enough detail that we
> understand your proposal, that would help.  What happens to the UI?  What
> happens to the internals?
>
> Thanks,
> Karl
>
>
>
> On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > But guys, why not simply pass to a classic SolrJ SolrDocument creation
> and
> > ingestion in the Solr Server ? Easy and Straighforward !
> >
> > In the end at that point the RepositoryDocument will me only a Map of
> > metadata and values.
> > Content will be part of that, so I guess the conversion to a SolrDocument
> > will be immediate.
> >
> > Cheers
> >
> >
> > 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
> >
> > > Hi Abe-san,
> > >
> > > Near as I can tell, the major consumer of disk space is the Maven
> target
> > > directories.  This is generating many tens of megabytes of temporary
> disk
> > > usage for every connector.  Luckily if you use ant, this is not a
> > problem.
> > >
> > > Karl
> > >
> > >
> > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com>
> wrote:
> > >
> > > > Hi Abe-san,
> > > >
> > > > Tika jars are not very big:
> > > >
> > > > C:\wip\mcf\trunk\lib>dir tika*
> > > >  Volume in drive C has no label.
> > > >  Volume Serial Number is 002E-D1F0
> > > >
> > > >  Directory of C:\wip\mcf\trunk\lib
> > > >
> > > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > > >                2 File(s)      1,017,051 bytes
> > > >                0 Dir(s)  140,792,315,904 bytes free
> > > >
> > > > The entire lib directory is 85M:
> > > >
> > > > 85,156,330 bytes
> > > >
> > > > The built binary image is still about 185Mb, I believe.  So I don't
> > know
> > > > why you think it is >1Gb?  Temporary class files?  I don't think we
> can
> > > > avoid those.
> > > >
> > > > I'd rather not make things more complicated than they need to be by
> > > adding
> > > > a new required service - even though it would fit naturally with the
> > > > connector arrangement.
> > > >
> > > > Karl
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > > shinichiro.abe.1@gmail.com> wrote:
> > > >
> > > >> Hi Karl,
> > > >>
> > > >> Okay, I assumed Tika connector outputs files.
> > > >> If we post character data metadata got from Tika, "/update/extract"
> > > >> handler
> > > >> can handle this(provides params:
> > > >> literal.content=value&literal.metaField=foobar
> > > >> with using NullInputStream for binary data like CONNECTORS-936).
> > > >>
> > > >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> > > >> connector uses Tika jars.
> > > >> Tika connector and CloudSearch connector should extract text via
> > > >> tika-server[1]
> > > >> and MCF should not have many Tika jars, do you think?
> > > >>
> > > >> [1]
> > > >> http://wiki.apache.org/tika/TikaJAXRS
> > > >>
> > > >> Thanks,
> > > >> Shinichiro Abe
> > > >>
> > > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> > > >>
> > > >> > Hi Abe-san,
> > > >> >
> > > >> > It sounds like you might be thinking that transformation
> connectors
> > > are
> > > >> > like output connectors.  Just so we are clear, transformation
> > > >> connectors in
> > > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > > >> > RepositoryDocument on to the next connector in the chain.  So I
> > don't
> > > >> know
> > > >> > why .xml files would be involved.  I'd expect the Tika connector
> to
> > > >> read a
> > > >> > binary file from one RepositoryDocument object and convert its
> > > contents
> > > >> to
> > > >> > another RepositoryDocument object which would have character data
> > and
> > > >> > metadata only.  Would this work for your case, do you think?
> > > >> >
> > > >> > Karl
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > > >> shinichiro.abe.1@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Hi Karl,
> > > >> >>
> > > >> >> Yes. I thought the standard update handler met that requirement.
> > > >> >> For instance, Tika extractor transformation connector creates two
> > > >> files.
> > > >> >> 1. addtoSolr.xml for add and update
> > > >> >> 2. deletetoSolr.xml for delete
> > > >> >> File connector ingests these xml files, then Solr connector posts
> > > these
> > > >> >> files by "/update" handler.
> > > >> >>
> > > >> >> In the the Solr Connector, other function as to update handler
> > > >> >> might not be necessary except for  "/update" handler.
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Shinichiro Abe
> > > >> >>
> > > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
> > > >> >>
> > > >> >>> Hi Abe-san,
> > > >> >>>
> > > >> >>> So just to be sure -- you believe that no changes at all are
> > > required
> > > >> to
> > > >> >>> the Solr Connector as it stands now, other than to use the
> update
> > > >> handler
> > > >> >>> rather than the /update/extract handler?
> > > >> >>>
> > > >> >>> Karl
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>>
> > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > > >> >> shinichiro.abe.1@gmail.com>
> > > >> >>> wrote:
> > > >> >>>
> > > >> >>>>> As for changing the Solr connector so that it doesn't go to
> the
> > > >> >> extracting
> > > >> >>>> update handler
> > > >> >>>>
> > > >> >>>> I don't think it needs to change Solr connector with new
> checkbox
> > > >> >> because
> > > >> >>>> currently we can change "/update/extract" into "/update" at
> > 'Update
> > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
> > > post
> > > >> >> CSV,
> > > >> >>>> JSON and XML files to Solr by changing that and using File
> > > connector.
> > > >> >> So I
> > > >> >>>> wish we allow Tika extractor transformation connector to create
> > XML
> > > >> >> files
> > > >> >>>> that Solr expects to see.
> > > >> >>>>
> > > >> >>>> Regards,
> > > >> >>>> Shinichiro Abe
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> > > >> >>>>
> > > >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi
> said
> > > >> they'd
> > > >> >>>>> contribute a Tika extractor transformation connector - and if
> > they
> > > >> >> don't
> > > >> >>>>> get around to that in a month or so, I may take a crack at it
> > > >> myself.
> > > >> >>>>>
> > > >> >>>>> As for changing the Solr connector so that it doesn't go to
> the
> > > >> >>>> extracting
> > > >> >>>>> update handler, it would be great if:
> > > >> >>>>> (1) Someone created a ticket for this, and
> > > >> >>>>> (2) A patch was provided that maintains backwards
> compatibility
> > > with
> > > >> >>>>> previous versions of the connector (so a checkbox would
> probably
> > > >> need
> > > >> >> to
> > > >> >>>> go
> > > >> >>>>> into the UI somewhere).  Do either of you want to start this
> > > >> process?
> > > >> >>>>>
> > > >> >>>>> Thanks!
> > > >> >>>>> Karl
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> > daddywri@gmail.com
> > > >
> > > >> >>>> wrote:
> > > >> >>>>>
> > > >> >>>>>> Hi guys,
> > > >> >>>>>>
> > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> > > >> pipeline,
> > > >> >>>> and
> > > >> >>>>>> is expected to have a Tika extractor as a transformation
> > > connector.
> > > >> >>>>>>
> > > >> >>>>>> Karl
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > > >> >>>>> m.grolla@sourcesense.com>
> > > >> >>>>>> wrote:
> > > >> >>>>>>
> > > >> >>>>>>> Thanks Alessandro,
> > > >> >>>>>>>       that explains the situation clearly.
> > > >> >>>>>>> And I agree that sending all the metadata as get parameter
> can
> > > be
> > > >> >>>>>>> problematic
> > > >> >>>>>>>
> > > >> >>>>>>> Cheers
> > > >> >>>>>>>
> > > >> >>>>>>> --
> > > >> >>>>>>> Matteo Grolla
> > > >> >>>>>>> Sourcesense - making sense of Open Source
> > > >> >>>>>>> http://www.sourcesense.com
> > > >> >>>>>>>
> > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti
> ha
> > > >> >>>> scritto:
> > > >> >>>>>>>
> > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> > extractors.
> > > >> >>>>>>>> The Repository connectors extracts directly the binary and
> > > there
> > > >> is
> > > >> >>>> no
> > > >> >>>>>>>> "Extractor Processor" yet.
> > > >> >>>>>>>> But recently a pipe-line processor architecture has been
> > > thought
> > > >> (
> > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > > >> >>>>>>>> So can fit there.
> > > >> >>>>>>>>
> > > >> >>>>>>>> Cheers
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > > >> m.grolla@sourcesense.com
> > > >> >>>>> :
> > > >> >>>>>>>>
> > > >> >>>>>>>>> Since Solr extracting request handler takes the binary and
> > > >> extracts
> > > >> >>>>>>> text
> > > >> >>>>>>>>> what is the point of not using Manifold extractor and send
> > > text
> > > >> and
> > > >> >>>>>>>>> binaries to solr?
> > > >> >>>>>>>>> I mean the end result is the same solr indexes text and
> > stores
> > > >> text
> > > >> >>>>>>>>> So if manifold supports text extraction it seems me this
> is
> > > the
> > > >> >>>> place
> > > >> >>>>>>>>> where it should be done
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> --
> > > >> >>>>>>>>> Matteo Grolla
> > > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > > >> >>>>>>>>> http://www.sourcesense.com
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> > > >> Morales
> > > >> >>>> ha
> > > >> >>>>>>>>> scritto:
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>> Hi Matteo
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Manifold already handles the extraction, but the only way
> > to
> > > >> send
> > > >> >>>>>>> binary
> > > >> >>>>>>>>>> content and document metadata to Solr is using the
> > > >> update/extract
> > > >> >>>>>>>>> handler,
> > > >> >>>>>>>>>> where the metadata is sent as query parameters and the
> > binary
> > > >> >>>>> content
> > > >> >>>>>>> is
> > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use
> Tika
> > > to
> > > >> >>>>> obtain
> > > >> >>>>>>> the
> > > >> >>>>>>>>>> raw content to be stored in Solr.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Regards
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > > >> >>>>>>> m.grolla@sourcesense.com
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> wrote:
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
> > > Solr
> > > >> >>>>>>> extracting
> > > >> >>>>>>>>>>> request handler to extract the content of an xml file
> > > >> >>>>>>>>>>> For performance reasons it would be better if Manifold
> > > handled
> > > >> >>>> the
> > > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > > >> >>>>>>>>>>> Is this because of the connector design, framework
> design
> > or
> > > >> just
> > > >> >>>>> to
> > > >> >>>>>>> be
> > > >> >>>>>>>>>>> done?
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> --
> > > >> >>>>>>>>>>> Matteo Grolla
> > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > > >> >>>>>>>>>>> http://www.sourcesense.com
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> --
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> ------------------------------
> > > >> >>>>>>>>>> This message should be regarded as confidential. If you
> > have
> > > >> >>>>> received
> > > >> >>>>>>>>> this
> > > >> >>>>>>>>>> email in error please notify the sender and destroy it
> > > >> >>>> immediately.
> > > >> >>>>>>>>>> Statements of intent shall only become binding when
> > confirmed
> > > >> in
> > > >> >>>>> hard
> > > >> >>>>>>>>> copy
> > > >> >>>>>>>>>> by an authorised signatory.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > > >> registration
> > > >> >>>>>>> number
> > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> > Shepherds
> > > >> Bush
> > > >> >>>>>>> Road,
> > > >> >>>>>>>>>> London W6 7AN.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> --
> > > >> >>>>>>>> --------------------------
> > > >> >>>>>>>>
> > > >> >>>>>>>> Benedetti Alessandro
> > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > > >> >>>>>>>>
> > > >> >>>>>>>> "Tyger, tyger burning bright
> > > >> >>>>>>>> In the forests of the night,
> > > >> >>>>>>>> What immortal hand or eye
> > > >> >>>>>>>> Could frame thy fearful symmetry?"
> > > >> >>>>>>>>
> > > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> --
> > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > > >> >>>> Shinichiro Abe
> > > >> >>>> 阿部 慎一朗
> > > >> >>>>
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Alessandro,
What is your concrete proposal to change the Solr connector?  Bear in mind
that we do need to maintain backwards compatibility.  If you list your
specific changes, not in any huge detail, but with enough detail that we
understand your proposal, that would help.  What happens to the UI?  What
happens to the internals?

Thanks,
Karl



On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> But guys, why not simply pass to a classic SolrJ SolrDocument creation and
> ingestion in the Solr Server ? Easy and Straighforward !
>
> In the end at that point the RepositoryDocument will me only a Map of
> metadata and values.
> Content will be part of that, so I guess the conversion to a SolrDocument
> will be immediate.
>
> Cheers
>
>
> 2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:
>
> > Hi Abe-san,
> >
> > Near as I can tell, the major consumer of disk space is the Maven target
> > directories.  This is generating many tens of megabytes of temporary disk
> > usage for every connector.  Luckily if you use ant, this is not a
> problem.
> >
> > Karl
> >
> >
> > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com> wrote:
> >
> > > Hi Abe-san,
> > >
> > > Tika jars are not very big:
> > >
> > > C:\wip\mcf\trunk\lib>dir tika*
> > >  Volume in drive C has no label.
> > >  Volume Serial Number is 002E-D1F0
> > >
> > >  Directory of C:\wip\mcf\trunk\lib
> > >
> > > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> > >                2 File(s)      1,017,051 bytes
> > >                0 Dir(s)  140,792,315,904 bytes free
> > >
> > > The entire lib directory is 85M:
> > >
> > > 85,156,330 bytes
> > >
> > > The built binary image is still about 185Mb, I believe.  So I don't
> know
> > > why you think it is >1Gb?  Temporary class files?  I don't think we can
> > > avoid those.
> > >
> > > I'd rather not make things more complicated than they need to be by
> > adding
> > > a new required service - even though it would fit naturally with the
> > > connector arrangement.
> > >
> > > Karl
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > > shinichiro.abe.1@gmail.com> wrote:
> > >
> > >> Hi Karl,
> > >>
> > >> Okay, I assumed Tika connector outputs files.
> > >> If we post character data metadata got from Tika, "/update/extract"
> > >> handler
> > >> can handle this(provides params:
> > >> literal.content=value&literal.metaField=foobar
> > >> with using NullInputStream for binary data like CONNECTORS-936).
> > >>
> > >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> > >> connector uses Tika jars.
> > >> Tika connector and CloudSearch connector should extract text via
> > >> tika-server[1]
> > >> and MCF should not have many Tika jars, do you think?
> > >>
> > >> [1]
> > >> http://wiki.apache.org/tika/TikaJAXRS
> > >>
> > >> Thanks,
> > >> Shinichiro Abe
> > >>
> > >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> > >>
> > >> > Hi Abe-san,
> > >> >
> > >> > It sounds like you might be thinking that transformation connectors
> > are
> > >> > like output connectors.  Just so we are clear, transformation
> > >> connectors in
> > >> > 1.7 receive a RepositoryDocument as input, and then pass a
> > >> > RepositoryDocument on to the next connector in the chain.  So I
> don't
> > >> know
> > >> > why .xml files would be involved.  I'd expect the Tika connector to
> > >> read a
> > >> > binary file from one RepositoryDocument object and convert its
> > contents
> > >> to
> > >> > another RepositoryDocument object which would have character data
> and
> > >> > metadata only.  Would this work for your case, do you think?
> > >> >
> > >> > Karl
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> > >> shinichiro.abe.1@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hi Karl,
> > >> >>
> > >> >> Yes. I thought the standard update handler met that requirement.
> > >> >> For instance, Tika extractor transformation connector creates two
> > >> files.
> > >> >> 1. addtoSolr.xml for add and update
> > >> >> 2. deletetoSolr.xml for delete
> > >> >> File connector ingests these xml files, then Solr connector posts
> > these
> > >> >> files by "/update" handler.
> > >> >>
> > >> >> In the the Solr Connector, other function as to update handler
> > >> >> might not be necessary except for  "/update" handler.
> > >> >>
> > >> >> Thanks,
> > >> >> Shinichiro Abe
> > >> >>
> > >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
> > >> >>
> > >> >>> Hi Abe-san,
> > >> >>>
> > >> >>> So just to be sure -- you believe that no changes at all are
> > required
> > >> to
> > >> >>> the Solr Connector as it stands now, other than to use the update
> > >> handler
> > >> >>> rather than the /update/extract handler?
> > >> >>>
> > >> >>> Karl
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> > >> >> shinichiro.abe.1@gmail.com>
> > >> >>> wrote:
> > >> >>>
> > >> >>>>> As for changing the Solr connector so that it doesn't go to the
> > >> >> extracting
> > >> >>>> update handler
> > >> >>>>
> > >> >>>> I don't think it needs to change Solr connector with new checkbox
> > >> >> because
> > >> >>>> currently we can change "/update/extract" into "/update" at
> 'Update
> > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
> > post
> > >> >> CSV,
> > >> >>>> JSON and XML files to Solr by changing that and using File
> > connector.
> > >> >> So I
> > >> >>>> wish we allow Tika extractor transformation connector to create
> XML
> > >> >> files
> > >> >>>> that Solr expects to see.
> > >> >>>>
> > >> >>>> Regards,
> > >> >>>> Shinichiro Abe
> > >> >>>>
> > >> >>>>
> > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> > >> >>>>
> > >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
> > >> they'd
> > >> >>>>> contribute a Tika extractor transformation connector - and if
> they
> > >> >> don't
> > >> >>>>> get around to that in a month or so, I may take a crack at it
> > >> myself.
> > >> >>>>>
> > >> >>>>> As for changing the Solr connector so that it doesn't go to the
> > >> >>>> extracting
> > >> >>>>> update handler, it would be great if:
> > >> >>>>> (1) Someone created a ticket for this, and
> > >> >>>>> (2) A patch was provided that maintains backwards compatibility
> > with
> > >> >>>>> previous versions of the connector (so a checkbox would probably
> > >> need
> > >> >> to
> > >> >>>> go
> > >> >>>>> into the UI somewhere).  Do either of you want to start this
> > >> process?
> > >> >>>>>
> > >> >>>>> Thanks!
> > >> >>>>> Karl
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <
> daddywri@gmail.com
> > >
> > >> >>>> wrote:
> > >> >>>>>
> > >> >>>>>> Hi guys,
> > >> >>>>>>
> > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> > >> pipeline,
> > >> >>>> and
> > >> >>>>>> is expected to have a Tika extractor as a transformation
> > connector.
> > >> >>>>>>
> > >> >>>>>> Karl
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > >> >>>>> m.grolla@sourcesense.com>
> > >> >>>>>> wrote:
> > >> >>>>>>
> > >> >>>>>>> Thanks Alessandro,
> > >> >>>>>>>       that explains the situation clearly.
> > >> >>>>>>> And I agree that sending all the metadata as get parameter can
> > be
> > >> >>>>>>> problematic
> > >> >>>>>>>
> > >> >>>>>>> Cheers
> > >> >>>>>>>
> > >> >>>>>>> --
> > >> >>>>>>> Matteo Grolla
> > >> >>>>>>> Sourcesense - making sense of Open Source
> > >> >>>>>>> http://www.sourcesense.com
> > >> >>>>>>>
> > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> > >> >>>> scritto:
> > >> >>>>>>>
> > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no
> extractors.
> > >> >>>>>>>> The Repository connectors extracts directly the binary and
> > there
> > >> is
> > >> >>>> no
> > >> >>>>>>>> "Extractor Processor" yet.
> > >> >>>>>>>> But recently a pipe-line processor architecture has been
> > thought
> > >> (
> > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> > >> >>>>>>>> So can fit there.
> > >> >>>>>>>>
> > >> >>>>>>>> Cheers
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> > >> m.grolla@sourcesense.com
> > >> >>>>> :
> > >> >>>>>>>>
> > >> >>>>>>>>> Since Solr extracting request handler takes the binary and
> > >> extracts
> > >> >>>>>>> text
> > >> >>>>>>>>> what is the point of not using Manifold extractor and send
> > text
> > >> and
> > >> >>>>>>>>> binaries to solr?
> > >> >>>>>>>>> I mean the end result is the same solr indexes text and
> stores
> > >> text
> > >> >>>>>>>>> So if manifold supports text extraction it seems me this is
> > the
> > >> >>>> place
> > >> >>>>>>>>> where it should be done
> > >> >>>>>>>>>
> > >> >>>>>>>>> --
> > >> >>>>>>>>> Matteo Grolla
> > >> >>>>>>>>> Sourcesense - making sense of Open Source
> > >> >>>>>>>>> http://www.sourcesense.com
> > >> >>>>>>>>>
> > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> > >> Morales
> > >> >>>> ha
> > >> >>>>>>>>> scritto:
> > >> >>>>>>>>>
> > >> >>>>>>>>>> Hi Matteo
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Manifold already handles the extraction, but the only way
> to
> > >> send
> > >> >>>>>>> binary
> > >> >>>>>>>>>> content and document metadata to Solr is using the
> > >> update/extract
> > >> >>>>>>>>> handler,
> > >> >>>>>>>>>> where the metadata is sent as query parameters and the
> binary
> > >> >>>>> content
> > >> >>>>>>> is
> > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika
> > to
> > >> >>>>> obtain
> > >> >>>>>>> the
> > >> >>>>>>>>>> raw content to be stored in Solr.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Regards
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > >> >>>>>>> m.grolla@sourcesense.com
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> wrote:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
> > Solr
> > >> >>>>>>> extracting
> > >> >>>>>>>>>>> request handler to extract the content of an xml file
> > >> >>>>>>>>>>> For performance reasons it would be better if Manifold
> > handled
> > >> >>>> the
> > >> >>>>>>>>>>> extraction letting Solr do the search engine
> > >> >>>>>>>>>>> Is this because of the connector design, framework design
> or
> > >> just
> > >> >>>>> to
> > >> >>>>>>> be
> > >> >>>>>>>>>>> done?
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> --
> > >> >>>>>>>>>>> Matteo Grolla
> > >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> > >> >>>>>>>>>>> http://www.sourcesense.com
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> --
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> ------------------------------
> > >> >>>>>>>>>> This message should be regarded as confidential. If you
> have
> > >> >>>>> received
> > >> >>>>>>>>> this
> > >> >>>>>>>>>> email in error please notify the sender and destroy it
> > >> >>>> immediately.
> > >> >>>>>>>>>> Statements of intent shall only become binding when
> confirmed
> > >> in
> > >> >>>>> hard
> > >> >>>>>>>>> copy
> > >> >>>>>>>>>> by an authorised signatory.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> > >> registration
> > >> >>>>>>> number
> > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229
> Shepherds
> > >> Bush
> > >> >>>>>>> Road,
> > >> >>>>>>>>>> London W6 7AN.
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> --
> > >> >>>>>>>> --------------------------
> > >> >>>>>>>>
> > >> >>>>>>>> Benedetti Alessandro
> > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> > >> >>>>>>>>
> > >> >>>>>>>> "Tyger, tyger burning bright
> > >> >>>>>>>> In the forests of the night,
> > >> >>>>>>>> What immortal hand or eye
> > >> >>>>>>>> Could frame thy fearful symmetry?"
> > >> >>>>>>>>
> > >> >>>>>>>> William Blake - Songs of Experience -1794 England
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>> --
> > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > >> >>>> Shinichiro Abe
> > >> >>>> 阿部 慎一朗
> > >> >>>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

But guys, why not simply pass to a classic SolrJ SolrDocument creation and
ingestion in the Solr Server ? Easy and Straighforward !

In the end at that point the RepositoryDocument will me only a Map of
metadata and values.
Content will be part of that, so I guess the conversion to a SolrDocument
will be immediate.

Cheers


2014-06-18 3:26 GMT+01:00 Karl Wright <da...@gmail.com>:

> Hi Abe-san,
>
> Near as I can tell, the major consumer of disk space is the Maven target
> directories.  This is generating many tens of megabytes of temporary disk
> usage for every connector.  Luckily if you use ant, this is not a problem.
>
> Karl
>
>
> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Abe-san,
> >
> > Tika jars are not very big:
> >
> > C:\wip\mcf\trunk\lib>dir tika*
> >  Volume in drive C has no label.
> >  Volume Serial Number is 002E-D1F0
> >
> >  Directory of C:\wip\mcf\trunk\lib
> >
> > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> >                2 File(s)      1,017,051 bytes
> >                0 Dir(s)  140,792,315,904 bytes free
> >
> > The entire lib directory is 85M:
> >
> > 85,156,330 bytes
> >
> > The built binary image is still about 185Mb, I believe.  So I don't know
> > why you think it is >1Gb?  Temporary class files?  I don't think we can
> > avoid those.
> >
> > I'd rather not make things more complicated than they need to be by
> adding
> > a new required service - even though it would fit naturally with the
> > connector arrangement.
> >
> > Karl
> >
> >
> >
> >
> >
> > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > shinichiro.abe.1@gmail.com> wrote:
> >
> >> Hi Karl,
> >>
> >> Okay, I assumed Tika connector outputs files.
> >> If we post character data metadata got from Tika, "/update/extract"
> >> handler
> >> can handle this(provides params:
> >> literal.content=value&literal.metaField=foobar
> >> with using NullInputStream for binary data like CONNECTORS-936).
> >>
> >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> >> connector uses Tika jars.
> >> Tika connector and CloudSearch connector should extract text via
> >> tika-server[1]
> >> and MCF should not have many Tika jars, do you think?
> >>
> >> [1]
> >> http://wiki.apache.org/tika/TikaJAXRS
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
> >>
> >> > Hi Abe-san,
> >> >
> >> > It sounds like you might be thinking that transformation connectors
> are
> >> > like output connectors.  Just so we are clear, transformation
> >> connectors in
> >> > 1.7 receive a RepositoryDocument as input, and then pass a
> >> > RepositoryDocument on to the next connector in the chain.  So I don't
> >> know
> >> > why .xml files would be involved.  I'd expect the Tika connector to
> >> read a
> >> > binary file from one RepositoryDocument object and convert its
> contents
> >> to
> >> > another RepositoryDocument object which would have character data and
> >> > metadata only.  Would this work for your case, do you think?
> >> >
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> >> shinichiro.abe.1@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Karl,
> >> >>
> >> >> Yes. I thought the standard update handler met that requirement.
> >> >> For instance, Tika extractor transformation connector creates two
> >> files.
> >> >> 1. addtoSolr.xml for add and update
> >> >> 2. deletetoSolr.xml for delete
> >> >> File connector ingests these xml files, then Solr connector posts
> these
> >> >> files by "/update" handler.
> >> >>
> >> >> In the the Solr Connector, other function as to update handler
> >> >> might not be necessary except for  "/update" handler.
> >> >>
> >> >> Thanks,
> >> >> Shinichiro Abe
> >> >>
> >> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
> >> >>
> >> >>> Hi Abe-san,
> >> >>>
> >> >>> So just to be sure -- you believe that no changes at all are
> required
> >> to
> >> >>> the Solr Connector as it stands now, other than to use the update
> >> handler
> >> >>> rather than the /update/extract handler?
> >> >>>
> >> >>> Karl
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> >> >> shinichiro.abe.1@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> >> extracting
> >> >>>> update handler
> >> >>>>
> >> >>>> I don't think it needs to change Solr connector with new checkbox
> >> >> because
> >> >>>> currently we can change "/update/extract" into "/update" at 'Update
> >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
> post
> >> >> CSV,
> >> >>>> JSON and XML files to Solr by changing that and using File
> connector.
> >> >> So I
> >> >>>> wish we allow Tika extractor transformation connector to create XML
> >> >> files
> >> >>>> that Solr expects to see.
> >> >>>>
> >> >>>> Regards,
> >> >>>> Shinichiro Abe
> >> >>>>
> >> >>>>
> >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> >> >>>>
> >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
> >> they'd
> >> >>>>> contribute a Tika extractor transformation connector - and if they
> >> >> don't
> >> >>>>> get around to that in a month or so, I may take a crack at it
> >> myself.
> >> >>>>>
> >> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> >>>> extracting
> >> >>>>> update handler, it would be great if:
> >> >>>>> (1) Someone created a ticket for this, and
> >> >>>>> (2) A patch was provided that maintains backwards compatibility
> with
> >> >>>>> previous versions of the connector (so a checkbox would probably
> >> need
> >> >> to
> >> >>>> go
> >> >>>>> into the UI somewhere).  Do either of you want to start this
> >> process?
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>> Karl
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.com
> >
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Hi guys,
> >> >>>>>>
> >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> >> pipeline,
> >> >>>> and
> >> >>>>>> is expected to have a Tika extractor as a transformation
> connector.
> >> >>>>>>
> >> >>>>>> Karl
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >> >>>>> m.grolla@sourcesense.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Thanks Alessandro,
> >> >>>>>>>       that explains the situation clearly.
> >> >>>>>>> And I agree that sending all the metadata as get parameter can
> be
> >> >>>>>>> problematic
> >> >>>>>>>
> >> >>>>>>> Cheers
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Matteo Grolla
> >> >>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>> http://www.sourcesense.com
> >> >>>>>>>
> >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> >> >>>> scritto:
> >> >>>>>>>
> >> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
> >> >>>>>>>> The Repository connectors extracts directly the binary and
> there
> >> is
> >> >>>> no
> >> >>>>>>>> "Extractor Processor" yet.
> >> >>>>>>>> But recently a pipe-line processor architecture has been
> thought
> >> (
> >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >> >>>>>>>> So can fit there.
> >> >>>>>>>>
> >> >>>>>>>> Cheers
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> >> m.grolla@sourcesense.com
> >> >>>>> :
> >> >>>>>>>>
> >> >>>>>>>>> Since Solr extracting request handler takes the binary and
> >> extracts
> >> >>>>>>> text
> >> >>>>>>>>> what is the point of not using Manifold extractor and send
> text
> >> and
> >> >>>>>>>>> binaries to solr?
> >> >>>>>>>>> I mean the end result is the same solr indexes text and stores
> >> text
> >> >>>>>>>>> So if manifold supports text extraction it seems me this is
> the
> >> >>>> place
> >> >>>>>>>>> where it should be done
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Matteo Grolla
> >> >>>>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>
> >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> >> Morales
> >> >>>> ha
> >> >>>>>>>>> scritto:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Matteo
> >> >>>>>>>>>>
> >> >>>>>>>>>> Manifold already handles the extraction, but the only way to
> >> send
> >> >>>>>>> binary
> >> >>>>>>>>>> content and document metadata to Solr is using the
> >> update/extract
> >> >>>>>>>>> handler,
> >> >>>>>>>>>> where the metadata is sent as query parameters and the binary
> >> >>>>> content
> >> >>>>>>> is
> >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika
> to
> >> >>>>> obtain
> >> >>>>>>> the
> >> >>>>>>>>>> raw content to be stored in Solr.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Regards
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >> >>>>>>> m.grolla@sourcesense.com
> >> >>>>>>>>>>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
> Solr
> >> >>>>>>> extracting
> >> >>>>>>>>>>> request handler to extract the content of an xml file
> >> >>>>>>>>>>> For performance reasons it would be better if Manifold
> handled
> >> >>>> the
> >> >>>>>>>>>>> extraction letting Solr do the search engine
> >> >>>>>>>>>>> Is this because of the connector design, framework design or
> >> just
> >> >>>>> to
> >> >>>>>>> be
> >> >>>>>>>>>>> done?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Matteo Grolla
> >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> --
> >> >>>>>>>>>>
> >> >>>>>>>>>> ------------------------------
> >> >>>>>>>>>> This message should be regarded as confidential. If you have
> >> >>>>> received
> >> >>>>>>>>> this
> >> >>>>>>>>>> email in error please notify the sender and destroy it
> >> >>>> immediately.
> >> >>>>>>>>>> Statements of intent shall only become binding when confirmed
> >> in
> >> >>>>> hard
> >> >>>>>>>>> copy
> >> >>>>>>>>>> by an authorised signatory.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> >> registration
> >> >>>>>>> number
> >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
> >> Bush
> >> >>>>>>> Road,
> >> >>>>>>>>>> London W6 7AN.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> --------------------------
> >> >>>>>>>>
> >> >>>>>>>> Benedetti Alessandro
> >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >> >>>>>>>>
> >> >>>>>>>> "Tyger, tyger burning bright
> >> >>>>>>>> In the forests of the night,
> >> >>>>>>>> What immortal hand or eye
> >> >>>>>>>> Could frame thy fearful symmetry?"
> >> >>>>>>>>
> >> >>>>>>>> William Blake - Songs of Experience -1794 England
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> >>>> Shinichiro Abe
> >> >>>> 阿部 慎一朗
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Abe-san,

Near as I can tell, the major consumer of disk space is the Maven target
directories.  This is generating many tens of megabytes of temporary disk
usage for every connector.  Luckily if you use ant, this is not a problem.

Karl


On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Abe-san,
>
> Tika jars are not very big:
>
> C:\wip\mcf\trunk\lib>dir tika*
>  Volume in drive C has no label.
>  Volume Serial Number is 002E-D1F0
>
>  Directory of C:\wip\mcf\trunk\lib
>
> 06/05/2014  08:21 AM           493,374 tika-core.jar
> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>                2 File(s)      1,017,051 bytes
>                0 Dir(s)  140,792,315,904 bytes free
>
> The entire lib directory is 85M:
>
> 85,156,330 bytes
>
> The built binary image is still about 185Mb, I believe.  So I don't know
> why you think it is >1Gb?  Temporary class files?  I don't think we can
> avoid those.
>
> I'd rather not make things more complicated than they need to be by adding
> a new required service - even though it would fit naturally with the
> connector arrangement.
>
> Karl
>
>
>
>
>
> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Okay, I assumed Tika connector outputs files.
>> If we post character data metadata got from Tika, "/update/extract"
>> handler
>> can handle this(provides params:
>> literal.content=value&literal.metaField=foobar
>> with using NullInputStream for binary data like CONNECTORS-936).
>>
>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>> connector uses Tika jars.
>> Tika connector and CloudSearch connector should extract text via
>> tika-server[1]
>> and MCF should not have many Tika jars, do you think?
>>
>> [1]
>> http://wiki.apache.org/tika/TikaJAXRS
>>
>> Thanks,
>> Shinichiro Abe
>>
>> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
>>
>> > Hi Abe-san,
>> >
>> > It sounds like you might be thinking that transformation connectors are
>> > like output connectors.  Just so we are clear, transformation
>> connectors in
>> > 1.7 receive a RepositoryDocument as input, and then pass a
>> > RepositoryDocument on to the next connector in the chain.  So I don't
>> know
>> > why .xml files would be involved.  I'd expect the Tika connector to
>> read a
>> > binary file from one RepositoryDocument object and convert its contents
>> to
>> > another RepositoryDocument object which would have character data and
>> > metadata only.  Would this work for your case, do you think?
>> >
>> > Karl
>> >
>> >
>> >
>> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>> shinichiro.abe.1@gmail.com>
>> > wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> Yes. I thought the standard update handler met that requirement.
>> >> For instance, Tika extractor transformation connector creates two
>> files.
>> >> 1. addtoSolr.xml for add and update
>> >> 2. deletetoSolr.xml for delete
>> >> File connector ingests these xml files, then Solr connector posts these
>> >> files by "/update" handler.
>> >>
>> >> In the the Solr Connector, other function as to update handler
>> >> might not be necessary except for  "/update" handler.
>> >>
>> >> Thanks,
>> >> Shinichiro Abe
>> >>
>> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
>> >>
>> >>> Hi Abe-san,
>> >>>
>> >>> So just to be sure -- you believe that no changes at all are required
>> to
>> >>> the Solr Connector as it stands now, other than to use the update
>> handler
>> >>> rather than the /update/extract handler?
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>> >> shinichiro.abe.1@gmail.com>
>> >>> wrote:
>> >>>
>> >>>>> As for changing the Solr connector so that it doesn't go to the
>> >> extracting
>> >>>> update handler
>> >>>>
>> >>>> I don't think it needs to change Solr connector with new checkbox
>> >> because
>> >>>> currently we can change "/update/extract" into "/update" at 'Update
>> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
>> >> CSV,
>> >>>> JSON and XML files to Solr by changing that and using File connector.
>> >> So I
>> >>>> wish we allow Tika extractor transformation connector to create XML
>> >> files
>> >>>> that Solr expects to see.
>> >>>>
>> >>>> Regards,
>> >>>> Shinichiro Abe
>> >>>>
>> >>>>
>> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>> >>>>
>> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
>> they'd
>> >>>>> contribute a Tika extractor transformation connector - and if they
>> >> don't
>> >>>>> get around to that in a month or so, I may take a crack at it
>> myself.
>> >>>>>
>> >>>>> As for changing the Solr connector so that it doesn't go to the
>> >>>> extracting
>> >>>>> update handler, it would be great if:
>> >>>>> (1) Someone created a ticket for this, and
>> >>>>> (2) A patch was provided that maintains backwards compatibility with
>> >>>>> previous versions of the connector (so a checkbox would probably
>> need
>> >> to
>> >>>> go
>> >>>>> into the UI somewhere).  Do either of you want to start this
>> process?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Karl
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
>> pipeline,
>> >>>> and
>> >>>>>> is expected to have a Tika extractor as a transformation connector.
>> >>>>>>
>> >>>>>> Karl
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>> >>>>> m.grolla@sourcesense.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Thanks Alessandro,
>> >>>>>>>       that explains the situation clearly.
>> >>>>>>> And I agree that sending all the metadata as get parameter can be
>> >>>>>>> problematic
>> >>>>>>>
>> >>>>>>> Cheers
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Matteo Grolla
>> >>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>> http://www.sourcesense.com
>> >>>>>>>
>> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>> >>>> scritto:
>> >>>>>>>
>> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>> >>>>>>>> The Repository connectors extracts directly the binary and there
>> is
>> >>>> no
>> >>>>>>>> "Extractor Processor" yet.
>> >>>>>>>> But recently a pipe-line processor architecture has been thought
>> (
>> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>> >>>>>>>> So can fit there.
>> >>>>>>>>
>> >>>>>>>> Cheers
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>> m.grolla@sourcesense.com
>> >>>>> :
>> >>>>>>>>
>> >>>>>>>>> Since Solr extracting request handler takes the binary and
>> extracts
>> >>>>>>> text
>> >>>>>>>>> what is the point of not using Manifold extractor and send text
>> and
>> >>>>>>>>> binaries to solr?
>> >>>>>>>>> I mean the end result is the same solr indexes text and stores
>> text
>> >>>>>>>>> So if manifold supports text extraction it seems me this is the
>> >>>> place
>> >>>>>>>>> where it should be done
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Matteo Grolla
>> >>>>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>>>> http://www.sourcesense.com
>> >>>>>>>>>
>> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
>> Morales
>> >>>> ha
>> >>>>>>>>> scritto:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Matteo
>> >>>>>>>>>>
>> >>>>>>>>>> Manifold already handles the extraction, but the only way to
>> send
>> >>>>>>> binary
>> >>>>>>>>>> content and document metadata to Solr is using the
>> update/extract
>> >>>>>>>>> handler,
>> >>>>>>>>>> where the metadata is sent as query parameters and the binary
>> >>>>> content
>> >>>>>>> is
>> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>> >>>>> obtain
>> >>>>>>> the
>> >>>>>>>>>> raw content to be stored in Solr.
>> >>>>>>>>>>
>> >>>>>>>>>> Regards
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>> >>>>>>> m.grolla@sourcesense.com
>> >>>>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>> >>>>>>> extracting
>> >>>>>>>>>>> request handler to extract the content of an xml file
>> >>>>>>>>>>> For performance reasons it would be better if Manifold handled
>> >>>> the
>> >>>>>>>>>>> extraction letting Solr do the search engine
>> >>>>>>>>>>> Is this because of the connector design, framework design or
>> just
>> >>>>> to
>> >>>>>>> be
>> >>>>>>>>>>> done?
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Matteo Grolla
>> >>>>>>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>>>>>> http://www.sourcesense.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------------------
>> >>>>>>>>>> This message should be regarded as confidential. If you have
>> >>>>> received
>> >>>>>>>>> this
>> >>>>>>>>>> email in error please notify the sender and destroy it
>> >>>> immediately.
>> >>>>>>>>>> Statements of intent shall only become binding when confirmed
>> in
>> >>>>> hard
>> >>>>>>>>> copy
>> >>>>>>>>>> by an authorised signatory.
>> >>>>>>>>>>
>> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>> registration
>> >>>>>>> number
>> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
>> Bush
>> >>>>>>> Road,
>> >>>>>>>>>> London W6 7AN.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> --------------------------
>> >>>>>>>>
>> >>>>>>>> Benedetti Alessandro
>> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>> >>>>>>>>
>> >>>>>>>> "Tyger, tyger burning bright
>> >>>>>>>> In the forests of the night,
>> >>>>>>>> What immortal hand or eye
>> >>>>>>>> Could frame thy fearful symmetry?"
>> >>>>>>>>
>> >>>>>>>> William Blake - Songs of Experience -1794 England
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> >>>> Shinichiro Abe
>> >>>> 阿部 慎一朗
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Solr Extracting request handler

Posted by Shinichiro Abe <sh...@gmail.com>.

Hi Karl,

> The entire lib directory is 85M:
You are correct. I'm sorry, trunk size exceeded 1g as I ran 'ant javadoc', so no problem.

> I'd rather not make things more complicated than they need to be by adding
> a new required service
Ok. I understand.

Shinichiro Abe

On 2014/06/18, at 10:55, Karl Wright <da...@gmail.com> wrote:

> Hi Abe-san,
> 
> Tika jars are not very big:
> 
> C:\wip\mcf\trunk\lib>dir tika*
> Volume in drive C has no label.
> Volume Serial Number is 002E-D1F0
> 
> Directory of C:\wip\mcf\trunk\lib
> 
> 06/05/2014  08:21 AM           493,374 tika-core.jar
> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>               2 File(s)      1,017,051 bytes
>               0 Dir(s)  140,792,315,904 bytes free
> 
> The entire lib directory is 85M:
> 
> 85,156,330 bytes
> 
> The built binary image is still about 185Mb, I believe.  So I don't know
> why you think it is >1Gb?  Temporary class files?  I don't think we can
> avoid those.
> 
> I'd rather not make things more complicated than they need to be by adding
> a new required service - even though it would fit naturally with the
> connector arrangement.
> 
> Karl
> 
> 
> 
> 
> 
> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <sh...@gmail.com>
> wrote:
> 
>> Hi Karl,
>> 
>> Okay, I assumed Tika connector outputs files.
>> If we post character data metadata got from Tika, "/update/extract" handler
>> can handle this(provides params:
>> literal.content=value&literal.metaField=foobar
>> with using NullInputStream for binary data like CONNECTORS-936).
>> 
>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>> connector uses Tika jars.
>> Tika connector and CloudSearch connector should extract text via
>> tika-server[1]
>> and MCF should not have many Tika jars, do you think?
>> 
>> [1]
>> http://wiki.apache.org/tika/TikaJAXRS
>> 
>> Thanks,
>> Shinichiro Abe
>> 
>> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
>> 
>>> Hi Abe-san,
>>> 
>>> It sounds like you might be thinking that transformation connectors are
>>> like output connectors.  Just so we are clear, transformation connectors
>> in
>>> 1.7 receive a RepositoryDocument as input, and then pass a
>>> RepositoryDocument on to the next connector in the chain.  So I don't
>> know
>>> why .xml files would be involved.  I'd expect the Tika connector to read
>> a
>>> binary file from one RepositoryDocument object and convert its contents
>> to
>>> another RepositoryDocument object which would have character data and
>>> metadata only.  Would this work for your case, do you think?
>>> 
>>> Karl
>>> 
>>> 
>>> 
>>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>> shinichiro.abe.1@gmail.com>
>>> wrote:
>>> 
>>>> Hi Karl,
>>>> 
>>>> Yes. I thought the standard update handler met that requirement.
>>>> For instance, Tika extractor transformation connector creates two files.
>>>> 1. addtoSolr.xml for add and update
>>>> 2. deletetoSolr.xml for delete
>>>> File connector ingests these xml files, then Solr connector posts these
>>>> files by "/update" handler.
>>>> 
>>>> In the the Solr Connector, other function as to update handler
>>>> might not be necessary except for  "/update" handler.
>>>> 
>>>> Thanks,
>>>> Shinichiro Abe
>>>> 
>>>> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
>>>> 
>>>>> Hi Abe-san,
>>>>> 
>>>>> So just to be sure -- you believe that no changes at all are required
>> to
>>>>> the Solr Connector as it stands now, other than to use the update
>> handler
>>>>> rather than the /update/extract handler?
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>>> shinichiro.abe.1@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>>> As for changing the Solr connector so that it doesn't go to the
>>>> extracting
>>>>>> update handler
>>>>>> 
>>>>>> I don't think it needs to change Solr connector with new checkbox
>>>> because
>>>>>> currently we can change "/update/extract" into "/update" at 'Update
>>>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
>>>> CSV,
>>>>>> JSON and XML files to Solr by changing that and using File connector.
>>>> So I
>>>>>> wish we allow Tika extractor transformation connector to create XML
>>>> files
>>>>>> that Solr expects to see.
>>>>>> 
>>>>>> Regards,
>>>>>> Shinichiro Abe
>>>>>> 
>>>>>> 
>>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>>>>>> 
>>>>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
>> they'd
>>>>>>> contribute a Tika extractor transformation connector - and if they
>>>> don't
>>>>>>> get around to that in a month or so, I may take a crack at it myself.
>>>>>>> 
>>>>>>> As for changing the Solr connector so that it doesn't go to the
>>>>>> extracting
>>>>>>> update handler, it would be great if:
>>>>>>> (1) Someone created a ticket for this, and
>>>>>>> (2) A patch was provided that maintains backwards compatibility with
>>>>>>> previous versions of the connector (so a checkbox would probably need
>>>> to
>>>>>> go
>>>>>>> into the UI somewhere).  Do either of you want to start this process?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> You folks may not have looked at 1.7 yet, but it has a full
>> pipeline,
>>>>>> and
>>>>>>>> is expected to have a Tika extractor as a transformation connector.
>>>>>>>> 
>>>>>>>> Karl
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>>>> m.grolla@sourcesense.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Alessandro,
>>>>>>>>>      that explains the situation clearly.
>>>>>>>>> And I agree that sending all the metadata as get parameter can be
>>>>>>>>> problematic
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Matteo Grolla
>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>> http://www.sourcesense.com
>>>>>>>>> 
>>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>>>>>>>>>> The Repository connectors extracts directly the binary and there
>> is
>>>>>> no
>>>>>>>>>> "Extractor Processor" yet.
>>>>>>>>>> But recently a pipe-line processor architecture has been thought (
>>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>>>> So can fit there.
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>> m.grolla@sourcesense.com
>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> Since Solr extracting request handler takes the binary and
>> extracts
>>>>>>>>> text
>>>>>>>>>>> what is the point of not using Manifold extractor and send text
>> and
>>>>>>>>>>> binaries to solr?
>>>>>>>>>>> I mean the end result is the same solr indexes text and stores
>> text
>>>>>>>>>>> So if manifold supports text extraction it seems me this is the
>>>>>> place
>>>>>>>>>>> where it should be done
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>> 
>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
>> Morales
>>>>>> ha
>>>>>>>>>>> scritto:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Matteo
>>>>>>>>>>>> 
>>>>>>>>>>>> Manifold already handles the extraction, but the only way to
>> send
>>>>>>>>> binary
>>>>>>>>>>>> content and document metadata to Solr is using the
>> update/extract
>>>>>>>>>>> handler,
>>>>>>>>>>>> where the metadata is sent as query parameters and the binary
>>>>>>> content
>>>>>>>>> is
>>>>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>>>>>>> obtain
>>>>>>>>> the
>>>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>>>> m.grolla@sourcesense.com
>>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>>>>>>>>> extracting
>>>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>>>> For performance reasons it would be better if Manifold handled
>>>>>> the
>>>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>>>> Is this because of the connector design, framework design or
>> just
>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>> done?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 
>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>> This message should be regarded as confidential. If you have
>>>>>>> received
>>>>>>>>>>> this
>>>>>>>>>>>> email in error please notify the sender and destroy it
>>>>>> immediately.
>>>>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>>>>> hard
>>>>>>>>>>> copy
>>>>>>>>>>>> by an authorised signatory.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>> registration
>>>>>>>>> number
>>>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
>> Bush
>>>>>>>>> Road,
>>>>>>>>>>>> London W6 7AN.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> --------------------------
>>>>>>>>>> 
>>>>>>>>>> Benedetti Alessandro
>>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>> 
>>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>>> In the forests of the night,
>>>>>>>>>> What immortal hand or eye
>>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>> 
>>>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>> Shinichiro Abe
>>>>>> 阿部 慎一朗
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Abe-san,

Tika jars are not very big:

C:\wip\mcf\trunk\lib>dir tika*
 Volume in drive C has no label.
 Volume Serial Number is 002E-D1F0

 Directory of C:\wip\mcf\trunk\lib

06/05/2014  08:21 AM           493,374 tika-core.jar
06/05/2014  08:21 AM           523,677 tika-parsers.jar
               2 File(s)      1,017,051 bytes
               0 Dir(s)  140,792,315,904 bytes free

The entire lib directory is 85M:

85,156,330 bytes

The built binary image is still about 185Mb, I believe.  So I don't know
why you think it is >1Gb?  Temporary class files?  I don't think we can
avoid those.

I'd rather not make things more complicated than they need to be by adding
a new required service - even though it would fit naturally with the
connector arrangement.

Karl





On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <sh...@gmail.com>
wrote:

> Hi Karl,
>
> Okay, I assumed Tika connector outputs files.
> If we post character data metadata got from Tika, "/update/extract" handler
> can handle this(provides params:
> literal.content=value&literal.metaField=foobar
> with using NullInputStream for binary data like CONNECTORS-936).
>
> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> connector uses Tika jars.
> Tika connector and CloudSearch connector should extract text via
> tika-server[1]
> and MCF should not have many Tika jars, do you think?
>
> [1]
> http://wiki.apache.org/tika/TikaJAXRS
>
> Thanks,
> Shinichiro Abe
>
> On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Abe-san,
> >
> > It sounds like you might be thinking that transformation connectors are
> > like output connectors.  Just so we are clear, transformation connectors
> in
> > 1.7 receive a RepositoryDocument as input, and then pass a
> > RepositoryDocument on to the next connector in the chain.  So I don't
> know
> > why .xml files would be involved.  I'd expect the Tika connector to read
> a
> > binary file from one RepositoryDocument object and convert its contents
> to
> > another RepositoryDocument object which would have character data and
> > metadata only.  Would this work for your case, do you think?
> >
> > Karl
> >
> >
> >
> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com>
> > wrote:
> >
> >> Hi Karl,
> >>
> >> Yes. I thought the standard update handler met that requirement.
> >> For instance, Tika extractor transformation connector creates two files.
> >> 1. addtoSolr.xml for add and update
> >> 2. deletetoSolr.xml for delete
> >> File connector ingests these xml files, then Solr connector posts these
> >> files by "/update" handler.
> >>
> >> In the the Solr Connector, other function as to update handler
> >> might not be necessary except for  "/update" handler.
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
> >>
> >>> Hi Abe-san,
> >>>
> >>> So just to be sure -- you believe that no changes at all are required
> to
> >>> the Solr Connector as it stands now, other than to use the update
> handler
> >>> rather than the /update/extract handler?
> >>>
> >>> Karl
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> >> shinichiro.abe.1@gmail.com>
> >>> wrote:
> >>>
> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> extracting
> >>>> update handler
> >>>>
> >>>> I don't think it needs to change Solr connector with new checkbox
> >> because
> >>>> currently we can change "/update/extract" into "/update" at 'Update
> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
> >> CSV,
> >>>> JSON and XML files to Solr by changing that and using File connector.
> >> So I
> >>>> wish we allow Tika extractor transformation connector to create XML
> >> files
> >>>> that Solr expects to see.
> >>>>
> >>>> Regards,
> >>>> Shinichiro Abe
> >>>>
> >>>>
> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> >>>>
> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
> they'd
> >>>>> contribute a Tika extractor transformation connector - and if they
> >> don't
> >>>>> get around to that in a month or so, I may take a crack at it myself.
> >>>>>
> >>>>> As for changing the Solr connector so that it doesn't go to the
> >>>> extracting
> >>>>> update handler, it would be great if:
> >>>>> (1) Someone created a ticket for this, and
> >>>>> (2) A patch was provided that maintains backwards compatibility with
> >>>>> previous versions of the connector (so a checkbox would probably need
> >> to
> >>>> go
> >>>>> into the UI somewhere).  Do either of you want to start this process?
> >>>>>
> >>>>> Thanks!
> >>>>> Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> pipeline,
> >>>> and
> >>>>>> is expected to have a Tika extractor as a transformation connector.
> >>>>>>
> >>>>>> Karl
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >>>>> m.grolla@sourcesense.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks Alessandro,
> >>>>>>>       that explains the situation clearly.
> >>>>>>> And I agree that sending all the metadata as get parameter can be
> >>>>>>> problematic
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>> --
> >>>>>>> Matteo Grolla
> >>>>>>> Sourcesense - making sense of Open Source
> >>>>>>> http://www.sourcesense.com
> >>>>>>>
> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> >>>> scritto:
> >>>>>>>
> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
> >>>>>>>> The Repository connectors extracts directly the binary and there
> is
> >>>> no
> >>>>>>>> "Extractor Processor" yet.
> >>>>>>>> But recently a pipe-line processor architecture has been thought (
> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >>>>>>>> So can fit there.
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> m.grolla@sourcesense.com
> >>>>> :
> >>>>>>>>
> >>>>>>>>> Since Solr extracting request handler takes the binary and
> extracts
> >>>>>>> text
> >>>>>>>>> what is the point of not using Manifold extractor and send text
> and
> >>>>>>>>> binaries to solr?
> >>>>>>>>> I mean the end result is the same solr indexes text and stores
> text
> >>>>>>>>> So if manifold supports text extraction it seems me this is the
> >>>> place
> >>>>>>>>> where it should be done
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Matteo Grolla
> >>>>>>>>> Sourcesense - making sense of Open Source
> >>>>>>>>> http://www.sourcesense.com
> >>>>>>>>>
> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> Morales
> >>>> ha
> >>>>>>>>> scritto:
> >>>>>>>>>
> >>>>>>>>>> Hi Matteo
> >>>>>>>>>>
> >>>>>>>>>> Manifold already handles the extraction, but the only way to
> send
> >>>>>>> binary
> >>>>>>>>>> content and document metadata to Solr is using the
> update/extract
> >>>>>>>>> handler,
> >>>>>>>>>> where the metadata is sent as query parameters and the binary
> >>>>> content
> >>>>>>> is
> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
> >>>>> obtain
> >>>>>>> the
> >>>>>>>>>> raw content to be stored in Solr.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >>>>>>> m.grolla@sourcesense.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
> >>>>>>> extracting
> >>>>>>>>>>> request handler to extract the content of an xml file
> >>>>>>>>>>> For performance reasons it would be better if Manifold handled
> >>>> the
> >>>>>>>>>>> extraction letting Solr do the search engine
> >>>>>>>>>>> Is this because of the connector design, framework design or
> just
> >>>>> to
> >>>>>>> be
> >>>>>>>>>>> done?
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Matteo Grolla
> >>>>>>>>>>> Sourcesense - making sense of Open Source
> >>>>>>>>>>> http://www.sourcesense.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------
> >>>>>>>>>> This message should be regarded as confidential. If you have
> >>>>> received
> >>>>>>>>> this
> >>>>>>>>>> email in error please notify the sender and destroy it
> >>>> immediately.
> >>>>>>>>>> Statements of intent shall only become binding when confirmed in
> >>>>> hard
> >>>>>>>>> copy
> >>>>>>>>>> by an authorised signatory.
> >>>>>>>>>>
> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> registration
> >>>>>>> number
> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
> Bush
> >>>>>>> Road,
> >>>>>>>>>> London W6 7AN.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> --------------------------
> >>>>>>>>
> >>>>>>>> Benedetti Alessandro
> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>>>>>
> >>>>>>>> "Tyger, tyger burning bright
> >>>>>>>> In the forests of the night,
> >>>>>>>> What immortal hand or eye
> >>>>>>>> Could frame thy fearful symmetry?"
> >>>>>>>>
> >>>>>>>> William Blake - Songs of Experience -1794 England
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>> Shinichiro Abe
> >>>> 阿部 慎一朗
> >>>>
> >>
> >>
>
>

Re: Solr Extracting request handler

Posted by Shinichiro Abe <sh...@gmail.com>.

Hi Karl,

Okay, I assumed Tika connector outputs files. 
If we post character data metadata got from Tika, "/update/extract" handler 
can handle this(provides params: literal.content=value&literal.metaField=foobar
with using NullInputStream for binary data like CONNECTORS-936).

BTW, now trunk built size is too big(1G+). Maybe because CloudSearch connector uses Tika jars.
Tika connector and CloudSearch connector should extract text via tika-server[1] 
and MCF should not have many Tika jars, do you think?

[1]
http://wiki.apache.org/tika/TikaJAXRS

Thanks,
Shinichiro Abe

On 2014/06/18, at 9:45, Karl Wright <da...@gmail.com> wrote:

> Hi Abe-san,
> 
> It sounds like you might be thinking that transformation connectors are
> like output connectors.  Just so we are clear, transformation connectors in
> 1.7 receive a RepositoryDocument as input, and then pass a
> RepositoryDocument on to the next connector in the chain.  So I don't know
> why .xml files would be involved.  I'd expect the Tika connector to read a
> binary file from one RepositoryDocument object and convert its contents to
> another RepositoryDocument object which would have character data and
> metadata only.  Would this work for your case, do you think?
> 
> Karl
> 
> 
> 
> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <sh...@gmail.com>
> wrote:
> 
>> Hi Karl,
>> 
>> Yes. I thought the standard update handler met that requirement.
>> For instance, Tika extractor transformation connector creates two files.
>> 1. addtoSolr.xml for add and update
>> 2. deletetoSolr.xml for delete
>> File connector ingests these xml files, then Solr connector posts these
>> files by "/update" handler.
>> 
>> In the the Solr Connector, other function as to update handler
>> might not be necessary except for  "/update" handler.
>> 
>> Thanks,
>> Shinichiro Abe
>> 
>> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
>> 
>>> Hi Abe-san,
>>> 
>>> So just to be sure -- you believe that no changes at all are required to
>>> the Solr Connector as it stands now, other than to use the update handler
>>> rather than the /update/extract handler?
>>> 
>>> Karl
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>> shinichiro.abe.1@gmail.com>
>>> wrote:
>>> 
>>>>> As for changing the Solr connector so that it doesn't go to the
>> extracting
>>>> update handler
>>>> 
>>>> I don't think it needs to change Solr connector with new checkbox
>> because
>>>> currently we can change "/update/extract" into "/update" at 'Update
>>>> Handler' at Paths tab in Solr connector UI. I confirmed I could post
>> CSV,
>>>> JSON and XML files to Solr by changing that and using File connector.
>> So I
>>>> wish we allow Tika extractor transformation connector to create XML
>> files
>>>> that Solr expects to see.
>>>> 
>>>> Regards,
>>>> Shinichiro Abe
>>>> 
>>>> 
>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>>>> 
>>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
>>>>> contribute a Tika extractor transformation connector - and if they
>> don't
>>>>> get around to that in a month or so, I may take a crack at it myself.
>>>>> 
>>>>> As for changing the Solr connector so that it doesn't go to the
>>>> extracting
>>>>> update handler, it would be great if:
>>>>> (1) Someone created a ticket for this, and
>>>>> (2) A patch was provided that maintains backwards compatibility with
>>>>> previous versions of the connector (so a checkbox would probably need
>> to
>>>> go
>>>>> into the UI somewhere).  Do either of you want to start this process?
>>>>> 
>>>>> Thanks!
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> You folks may not have looked at 1.7 yet, but it has a full pipeline,
>>>> and
>>>>>> is expected to have a Tika extractor as a transformation connector.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>>>> m.grolla@sourcesense.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Thanks Alessandro,
>>>>>>>       that explains the situation clearly.
>>>>>>> And I agree that sending all the metadata as get parameter can be
>>>>>>> problematic
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> --
>>>>>>> Matteo Grolla
>>>>>>> Sourcesense - making sense of Open Source
>>>>>>> http://www.sourcesense.com
>>>>>>> 
>>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>>>> scritto:
>>>>>>> 
>>>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>>>>>>>> The Repository connectors extracts directly the binary and there is
>>>> no
>>>>>>>> "Extractor Processor" yet.
>>>>>>>> But recently a pipe-line processor architecture has been thought (
>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>> So can fit there.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com
>>>>> :
>>>>>>>> 
>>>>>>>>> Since Solr extracting request handler takes the binary and extracts
>>>>>>> text
>>>>>>>>> what is the point of not using Manifold extractor and send text and
>>>>>>>>> binaries to solr?
>>>>>>>>> I mean the end result is the same solr indexes text and stores text
>>>>>>>>> So if manifold supports text extraction it seems me this is the
>>>> place
>>>>>>>>> where it should be done
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Matteo Grolla
>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>> http://www.sourcesense.com
>>>>>>>>> 
>>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
>>>> ha
>>>>>>>>> scritto:
>>>>>>>>> 
>>>>>>>>>> Hi Matteo
>>>>>>>>>> 
>>>>>>>>>> Manifold already handles the extraction, but the only way to send
>>>>>>> binary
>>>>>>>>>> content and document metadata to Solr is using the update/extract
>>>>>>>>> handler,
>>>>>>>>>> where the metadata is sent as query parameters and the binary
>>>>> content
>>>>>>> is
>>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>>>>> obtain
>>>>>>> the
>>>>>>>>>> raw content to be stored in Solr.
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>>>> m.grolla@sourcesense.com
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>>>>>>> extracting
>>>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>>>> For performance reasons it would be better if Manifold handled
>>>> the
>>>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>>>> Is this because of the connector design, framework design or just
>>>>> to
>>>>>>> be
>>>>>>>>>>> done?
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> ------------------------------
>>>>>>>>>> This message should be regarded as confidential. If you have
>>>>> received
>>>>>>>>> this
>>>>>>>>>> email in error please notify the sender and destroy it
>>>> immediately.
>>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>>> hard
>>>>>>>>> copy
>>>>>>>>>> by an authorised signatory.
>>>>>>>>>> 
>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>>>> number
>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>>>> Road,
>>>>>>>>>> London W6 7AN.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --------------------------
>>>>>>>> 
>>>>>>>> Benedetti Alessandro
>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>> 
>>>>>>>> "Tyger, tyger burning bright
>>>>>>>> In the forests of the night,
>>>>>>>> What immortal hand or eye
>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>> 
>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> Shinichiro Abe
>>>> 阿部 慎一朗
>>>> 
>> 
>>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Abe-san,

It sounds like you might be thinking that transformation connectors are
like output connectors.  Just so we are clear, transformation connectors in
1.7 receive a RepositoryDocument as input, and then pass a
RepositoryDocument on to the next connector in the chain.  So I don't know
why .xml files would be involved.  I'd expect the Tika connector to read a
binary file from one RepositoryDocument object and convert its contents to
another RepositoryDocument object which would have character data and
metadata only.  Would this work for your case, do you think?

Karl



On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <sh...@gmail.com>
wrote:

> Hi Karl,
>
> Yes. I thought the standard update handler met that requirement.
> For instance, Tika extractor transformation connector creates two files.
> 1. addtoSolr.xml for add and update
> 2. deletetoSolr.xml for delete
> File connector ingests these xml files, then Solr connector posts these
> files by "/update" handler.
>
> In the the Solr Connector, other function as to update handler
> might not be necessary except for  "/update" handler.
>
> Thanks,
> Shinichiro Abe
>
> On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Abe-san,
> >
> > So just to be sure -- you believe that no changes at all are required to
> > the Solr Connector as it stands now, other than to use the update handler
> > rather than the /update/extract handler?
> >
> > Karl
> >
> >
> >
> >
> >
> > On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com>
> > wrote:
> >
> >>> As for changing the Solr connector so that it doesn't go to the
> extracting
> >> update handler
> >>
> >> I don't think it needs to change Solr connector with new checkbox
> because
> >> currently we can change "/update/extract" into "/update" at 'Update
> >> Handler' at Paths tab in Solr connector UI. I confirmed I could post
> CSV,
> >> JSON and XML files to Solr by changing that and using File connector.
> So I
> >> wish we allow Tika extractor transformation connector to create XML
> files
> >> that Solr expects to see.
> >>
> >> Regards,
> >> Shinichiro Abe
> >>
> >>
> >> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
> >>
> >>> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
> >>> contribute a Tika extractor transformation connector - and if they
> don't
> >>> get around to that in a month or so, I may take a crack at it myself.
> >>>
> >>> As for changing the Solr connector so that it doesn't go to the
> >> extracting
> >>> update handler, it would be great if:
> >>> (1) Someone created a ticket for this, and
> >>> (2) A patch was provided that maintains backwards compatibility with
> >>> previous versions of the connector (so a checkbox would probably need
> to
> >> go
> >>> into the UI somewhere).  Do either of you want to start this process?
> >>>
> >>> Thanks!
> >>> Karl
> >>>
> >>>
> >>>
> >>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi guys,
> >>>>
> >>>> You folks may not have looked at 1.7 yet, but it has a full pipeline,
> >> and
> >>>> is expected to have a Tika extractor as a transformation connector.
> >>>>
> >>>> Karl
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >>> m.grolla@sourcesense.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks Alessandro,
> >>>>>        that explains the situation clearly.
> >>>>> And I agree that sending all the metadata as get parameter can be
> >>>>> problematic
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> --
> >>>>> Matteo Grolla
> >>>>> Sourcesense - making sense of Open Source
> >>>>> http://www.sourcesense.com
> >>>>>
> >>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> >> scritto:
> >>>>>
> >>>>>> mmmm the point is that right now ManifoldCF has no extractors.
> >>>>>> The Repository connectors extracts directly the binary and there is
> >> no
> >>>>>> "Extractor Processor" yet.
> >>>>>> But recently a pipe-line processor architecture has been thought (
> >>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >>>>>> So can fit there.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>>
> >>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com
> >>> :
> >>>>>>
> >>>>>>> Since Solr extracting request handler takes the binary and extracts
> >>>>> text
> >>>>>>> what is the point of not using Manifold extractor and send text and
> >>>>>>> binaries to solr?
> >>>>>>> I mean the end result is the same solr indexes text and stores text
> >>>>>>> So if manifold supports text extraction it seems me this is the
> >> place
> >>>>>>> where it should be done
> >>>>>>>
> >>>>>>> --
> >>>>>>> Matteo Grolla
> >>>>>>> Sourcesense - making sense of Open Source
> >>>>>>> http://www.sourcesense.com
> >>>>>>>
> >>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
> >> ha
> >>>>>>> scritto:
> >>>>>>>
> >>>>>>>> Hi Matteo
> >>>>>>>>
> >>>>>>>> Manifold already handles the extraction, but the only way to send
> >>>>> binary
> >>>>>>>> content and document metadata to Solr is using the update/extract
> >>>>>>> handler,
> >>>>>>>> where the metadata is sent as query parameters and the binary
> >>> content
> >>>>> is
> >>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
> >>> obtain
> >>>>> the
> >>>>>>>> raw content to be stored in Solr.
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >>>>> m.grolla@sourcesense.com
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
> >>>>> extracting
> >>>>>>>>> request handler to extract the content of an xml file
> >>>>>>>>> For performance reasons it would be better if Manifold handled
> >> the
> >>>>>>>>> extraction letting Solr do the search engine
> >>>>>>>>> Is this because of the connector design, framework design or just
> >>> to
> >>>>> be
> >>>>>>>>> done?
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Matteo Grolla
> >>>>>>>>> Sourcesense - making sense of Open Source
> >>>>>>>>> http://www.sourcesense.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> ------------------------------
> >>>>>>>> This message should be regarded as confidential. If you have
> >>> received
> >>>>>>> this
> >>>>>>>> email in error please notify the sender and destroy it
> >> immediately.
> >>>>>>>> Statements of intent shall only become binding when confirmed in
> >>> hard
> >>>>>>> copy
> >>>>>>>> by an authorised signatory.
> >>>>>>>>
> >>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
> >>>>> number
> >>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> >>>>> Road,
> >>>>>>>> London W6 7AN.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> --------------------------
> >>>>>>
> >>>>>> Benedetti Alessandro
> >>>>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>>>
> >>>>>> "Tyger, tyger burning bright
> >>>>>> In the forests of the night,
> >>>>>> What immortal hand or eye
> >>>>>> Could frame thy fearful symmetry?"
> >>>>>>
> >>>>>> William Blake - Songs of Experience -1794 England
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> Shinichiro Abe
> >> 阿部 慎一朗
> >>
>
>

Re: Solr Extracting request handler

Posted by Shinichiro Abe <sh...@gmail.com>.

Hi Karl,

Yes. I thought the standard update handler met that requirement.
For instance, Tika extractor transformation connector creates two files.
1. addtoSolr.xml for add and update
2. deletetoSolr.xml for delete
File connector ingests these xml files, then Solr connector posts these files by "/update" handler.

In the the Solr Connector, other function as to update handler 
might not be necessary except for  "/update" handler.

Thanks,
Shinichiro Abe

On 2014/06/18, at 8:02, Karl Wright <da...@gmail.com> wrote:

> Hi Abe-san,
> 
> So just to be sure -- you believe that no changes at all are required to
> the Solr Connector as it stands now, other than to use the update handler
> rather than the /update/extract handler?
> 
> Karl
> 
> 
> 
> 
> 
> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <sh...@gmail.com>
> wrote:
> 
>>> As for changing the Solr connector so that it doesn't go to the extracting
>> update handler
>> 
>> I don't think it needs to change Solr connector with new checkbox because
>> currently we can change "/update/extract" into "/update" at 'Update
>> Handler' at Paths tab in Solr connector UI. I confirmed I could post CSV,
>> JSON and XML files to Solr by changing that and using File connector. So I
>> wish we allow Tika extractor transformation connector to create XML files
>> that Solr expects to see.
>> 
>> Regards,
>> Shinichiro Abe
>> 
>> 
>> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>> 
>>> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
>>> contribute a Tika extractor transformation connector - and if they don't
>>> get around to that in a month or so, I may take a crack at it myself.
>>> 
>>> As for changing the Solr connector so that it doesn't go to the
>> extracting
>>> update handler, it would be great if:
>>> (1) Someone created a ticket for this, and
>>> (2) A patch was provided that maintains backwards compatibility with
>>> previous versions of the connector (so a checkbox would probably need to
>> go
>>> into the UI somewhere).  Do either of you want to start this process?
>>> 
>>> Thanks!
>>> Karl
>>> 
>>> 
>>> 
>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
>> wrote:
>>> 
>>>> Hi guys,
>>>> 
>>>> You folks may not have looked at 1.7 yet, but it has a full pipeline,
>> and
>>>> is expected to have a Tika extractor as a transformation connector.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>>> m.grolla@sourcesense.com>
>>>> wrote:
>>>> 
>>>>> Thanks Alessandro,
>>>>>        that explains the situation clearly.
>>>>> And I agree that sending all the metadata as get parameter can be
>>>>> problematic
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> --
>>>>> Matteo Grolla
>>>>> Sourcesense - making sense of Open Source
>>>>> http://www.sourcesense.com
>>>>> 
>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
>> scritto:
>>>>> 
>>>>>> mmmm the point is that right now ManifoldCF has no extractors.
>>>>>> The Repository connectors extracts directly the binary and there is
>> no
>>>>>> "Extractor Processor" yet.
>>>>>> But recently a pipe-line processor architecture has been thought (
>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>> So can fit there.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com
>>> :
>>>>>> 
>>>>>>> Since Solr extracting request handler takes the binary and extracts
>>>>> text
>>>>>>> what is the point of not using Manifold extractor and send text and
>>>>>>> binaries to solr?
>>>>>>> I mean the end result is the same solr indexes text and stores text
>>>>>>> So if manifold supports text extraction it seems me this is the
>> place
>>>>>>> where it should be done
>>>>>>> 
>>>>>>> --
>>>>>>> Matteo Grolla
>>>>>>> Sourcesense - making sense of Open Source
>>>>>>> http://www.sourcesense.com
>>>>>>> 
>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
>> ha
>>>>>>> scritto:
>>>>>>> 
>>>>>>>> Hi Matteo
>>>>>>>> 
>>>>>>>> Manifold already handles the extraction, but the only way to send
>>>>> binary
>>>>>>>> content and document metadata to Solr is using the update/extract
>>>>>>> handler,
>>>>>>>> where the metadata is sent as query parameters and the binary
>>> content
>>>>> is
>>>>>>>> sent in the body of the requests, allowing Solr to use Tika to
>>> obtain
>>>>> the
>>>>>>>> raw content to be stored in Solr.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>>>>> m.grolla@sourcesense.com
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi During my first indexing I noticed that manifold uses Solr
>>>>> extracting
>>>>>>>>> request handler to extract the content of an xml file
>>>>>>>>> For performance reasons it would be better if Manifold handled
>> the
>>>>>>>>> extraction letting Solr do the search engine
>>>>>>>>> Is this because of the connector design, framework design or just
>>> to
>>>>> be
>>>>>>>>> done?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Matteo Grolla
>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>> http://www.sourcesense.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> ------------------------------
>>>>>>>> This message should be regarded as confidential. If you have
>>> received
>>>>>>> this
>>>>>>>> email in error please notify the sender and destroy it
>> immediately.
>>>>>>>> Statements of intent shall only become binding when confirmed in
>>> hard
>>>>>>> copy
>>>>>>>> by an authorised signatory.
>>>>>>>> 
>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>> number
>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>> Road,
>>>>>>>> London W6 7AN.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> --------------------------
>>>>>> 
>>>>>> Benedetti Alessandro
>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>> 
>>>>>> "Tyger, tyger burning bright
>>>>>> In the forests of the night,
>>>>>> What immortal hand or eye
>>>>>> Could frame thy fearful symmetry?"
>>>>>> 
>>>>>> William Blake - Songs of Experience -1794 England
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> Shinichiro Abe
>> 阿部 慎一朗
>>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi Abe-san,

So just to be sure -- you believe that no changes at all are required to
the Solr Connector as it stands now, other than to use the update handler
rather than the /update/extract handler?

Karl





On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <sh...@gmail.com>
wrote:

> >As for changing the Solr connector so that it doesn't go to the extracting
> update handler
>
> I don't think it needs to change Solr connector with new checkbox because
> currently we can change "/update/extract" into "/update" at 'Update
> Handler' at Paths tab in Solr connector UI. I confirmed I could post CSV,
> JSON and XML files to Solr by changing that and using File connector. So I
> wish we allow Tika extractor transformation connector to create XML files
> that Solr expects to see.
>
> Regards,
> Shinichiro Abe
>
>
> 2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:
>
> > The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
> > contribute a Tika extractor transformation connector - and if they don't
> > get around to that in a month or so, I may take a crack at it myself.
> >
> > As for changing the Solr connector so that it doesn't go to the
> extracting
> > update handler, it would be great if:
> > (1) Someone created a ticket for this, and
> > (2) A patch was provided that maintains backwards compatibility with
> > previous versions of the connector (so a checkbox would probably need to
> go
> > into the UI somewhere).  Do either of you want to start this process?
> >
> > Thanks!
> > Karl
> >
> >
> >
> > On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com>
> wrote:
> >
> > > Hi guys,
> > >
> > > You folks may not have looked at 1.7 yet, but it has a full pipeline,
> and
> > > is expected to have a Tika extractor as a transformation connector.
> > >
> > > Karl
> > >
> > >
> > >
> > > On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > m.grolla@sourcesense.com>
> > > wrote:
> > >
> > >> Thanks Alessandro,
> > >>         that explains the situation clearly.
> > >> And I agree that sending all the metadata as get parameter can be
> > >> problematic
> > >>
> > >> Cheers
> > >>
> > >> --
> > >> Matteo Grolla
> > >> Sourcesense - making sense of Open Source
> > >> http://www.sourcesense.com
> > >>
> > >> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> scritto:
> > >>
> > >> > mmmm the point is that right now ManifoldCF has no extractors.
> > >> > The Repository connectors extracts directly the binary and there is
> no
> > >> > "Extractor Processor" yet.
> > >> > But recently a pipe-line processor architecture has been thought (
> > >> > https://issues.apache.org/jira/browse/CONNECTORS-959)
> > >> > So can fit there.
> > >> >
> > >> > Cheers
> > >> >
> > >> >
> > >> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com
> >:
> > >> >
> > >> >> Since Solr extracting request handler takes the binary and extracts
> > >> text
> > >> >> what is the point of not using Manifold extractor and send text and
> > >> >> binaries to solr?
> > >> >> I mean the end result is the same solr indexes text and stores text
> > >> >> So if manifold supports text extraction it seems me this is the
> place
> > >> >> where it should be done
> > >> >>
> > >> >> --
> > >> >> Matteo Grolla
> > >> >> Sourcesense - making sense of Open Source
> > >> >> http://www.sourcesense.com
> > >> >>
> > >> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
> ha
> > >> >> scritto:
> > >> >>
> > >> >>> Hi Matteo
> > >> >>>
> > >> >>> Manifold already handles the extraction, but the only way to send
> > >> binary
> > >> >>> content and document metadata to Solr is using the update/extract
> > >> >> handler,
> > >> >>> where the metadata is sent as query parameters and the binary
> > content
> > >> is
> > >> >>> sent in the body of the requests, allowing Solr to use Tika to
> > obtain
> > >> the
> > >> >>> raw content to be stored in Solr.
> > >> >>>
> > >> >>> Regards
> > >> >>>
> > >> >>>
> > >> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > >> m.grolla@sourcesense.com
> > >> >>>
> > >> >>> wrote:
> > >> >>>
> > >> >>>> Hi During my first indexing I noticed that manifold uses Solr
> > >> extracting
> > >> >>>> request handler to extract the content of an xml file
> > >> >>>> For performance reasons it would be better if Manifold handled
> the
> > >> >>>> extraction letting Solr do the search engine
> > >> >>>> Is this because of the connector design, framework design or just
> > to
> > >> be
> > >> >>>> done?
> > >> >>>>
> > >> >>>> --
> > >> >>>> Matteo Grolla
> > >> >>>> Sourcesense - making sense of Open Source
> > >> >>>> http://www.sourcesense.com
> > >> >>>>
> > >> >>>>
> > >> >>>
> > >> >>> --
> > >> >>>
> > >> >>> ------------------------------
> > >> >>> This message should be regarded as confidential. If you have
> > received
> > >> >> this
> > >> >>> email in error please notify the sender and destroy it
> immediately.
> > >> >>> Statements of intent shall only become binding when confirmed in
> > hard
> > >> >> copy
> > >> >>> by an authorised signatory.
> > >> >>>
> > >> >>> Zaizi Ltd is registered in England and Wales with the registration
> > >> number
> > >> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> > >> Road,
> > >> >>> London W6 7AN.
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > --------------------------
> > >> >
> > >> > Benedetti Alessandro
> > >> > Visiting card : http://about.me/alessandro_benedetti
> > >> >
> > >> > "Tyger, tyger burning bright
> > >> > In the forests of the night,
> > >> > What immortal hand or eye
> > >> > Could frame thy fearful symmetry?"
> > >> >
> > >> > William Blake - Songs of Experience -1794 England
> > >>
> > >>
> > >
> >
>
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Shinichiro Abe
> 阿部 慎一朗
>

Re: Solr Extracting request handler

Posted by Shinichiro Abe <sh...@gmail.com>.

>As for changing the Solr connector so that it doesn't go to the extracting
update handler

I don't think it needs to change Solr connector with new checkbox because
currently we can change "/update/extract" into "/update" at 'Update
Handler' at Paths tab in Solr connector UI. I confirmed I could post CSV,
JSON and XML files to Solr by changing that and using File connector. So I
wish we allow Tika extractor transformation connector to create XML files
that Solr expects to see.

Regards,
Shinichiro Abe


2014-06-18 2:55 GMT+09:00 Karl Wright <da...@gmail.com>:

> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
> contribute a Tika extractor transformation connector - and if they don't
> get around to that in a month or so, I may take a crack at it myself.
>
> As for changing the Solr connector so that it doesn't go to the extracting
> update handler, it would be great if:
> (1) Someone created a ticket for this, and
> (2) A patch was provided that maintains backwards compatibility with
> previous versions of the connector (so a checkbox would probably need to go
> into the UI somewhere).  Do either of you want to start this process?
>
> Thanks!
> Karl
>
>
>
> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi guys,
> >
> > You folks may not have looked at 1.7 yet, but it has a full pipeline, and
> > is expected to have a Tika extractor as a transformation connector.
> >
> > Karl
> >
> >
> >
> > On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> m.grolla@sourcesense.com>
> > wrote:
> >
> >> Thanks Alessandro,
> >>         that explains the situation clearly.
> >> And I agree that sending all the metadata as get parameter can be
> >> problematic
> >>
> >> Cheers
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha scritto:
> >>
> >> > mmmm the point is that right now ManifoldCF has no extractors.
> >> > The Repository connectors extracts directly the binary and there is no
> >> > "Extractor Processor" yet.
> >> > But recently a pipe-line processor architecture has been thought (
> >> > https://issues.apache.org/jira/browse/CONNECTORS-959)
> >> > So can fit there.
> >> >
> >> > Cheers
> >> >
> >> >
> >> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m....@sourcesense.com>:
> >> >
> >> >> Since Solr extracting request handler takes the binary and extracts
> >> text
> >> >> what is the point of not using Manifold extractor and send text and
> >> >> binaries to solr?
> >> >> I mean the end result is the same solr indexes text and stores text
> >> >> So if manifold supports text extraction it seems me this is the place
> >> >> where it should be done
> >> >>
> >> >> --
> >> >> Matteo Grolla
> >> >> Sourcesense - making sense of Open Source
> >> >> http://www.sourcesense.com
> >> >>
> >> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha
> >> >> scritto:
> >> >>
> >> >>> Hi Matteo
> >> >>>
> >> >>> Manifold already handles the extraction, but the only way to send
> >> binary
> >> >>> content and document metadata to Solr is using the update/extract
> >> >> handler,
> >> >>> where the metadata is sent as query parameters and the binary
> content
> >> is
> >> >>> sent in the body of the requests, allowing Solr to use Tika to
> obtain
> >> the
> >> >>> raw content to be stored in Solr.
> >> >>>
> >> >>> Regards
> >> >>>
> >> >>>
> >> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >> m.grolla@sourcesense.com
> >> >>>
> >> >>> wrote:
> >> >>>
> >> >>>> Hi During my first indexing I noticed that manifold uses Solr
> >> extracting
> >> >>>> request handler to extract the content of an xml file
> >> >>>> For performance reasons it would be better if Manifold handled the
> >> >>>> extraction letting Solr do the search engine
> >> >>>> Is this because of the connector design, framework design or just
> to
> >> be
> >> >>>> done?
> >> >>>>
> >> >>>> --
> >> >>>> Matteo Grolla
> >> >>>> Sourcesense - making sense of Open Source
> >> >>>> http://www.sourcesense.com
> >> >>>>
> >> >>>>
> >> >>>
> >> >>> --
> >> >>>
> >> >>> ------------------------------
> >> >>> This message should be regarded as confidential. If you have
> received
> >> >> this
> >> >>> email in error please notify the sender and destroy it immediately.
> >> >>> Statements of intent shall only become binding when confirmed in
> hard
> >> >> copy
> >> >>> by an authorised signatory.
> >> >>>
> >> >>> Zaizi Ltd is registered in England and Wales with the registration
> >> number
> >> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> >> Road,
> >> >>> London W6 7AN.
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > --------------------------
> >> >
> >> > Benedetti Alessandro
> >> > Visiting card : http://about.me/alessandro_benedetti
> >> >
> >> > "Tyger, tyger burning bright
> >> > In the forests of the night,
> >> > What immortal hand or eye
> >> > Could frame thy fearful symmetry?"
> >> >
> >> > William Blake - Songs of Experience -1794 England
> >>
> >>
> >
>



-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Shinichiro Abe
阿部 慎一朗

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
contribute a Tika extractor transformation connector - and if they don't
get around to that in a month or so, I may take a crack at it myself.

As for changing the Solr connector so that it doesn't go to the extracting
update handler, it would be great if:
(1) Someone created a ticket for this, and
(2) A patch was provided that maintains backwards compatibility with
previous versions of the connector (so a checkbox would probably need to go
into the UI somewhere).  Do either of you want to start this process?

Thanks!
Karl



On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <da...@gmail.com> wrote:

> Hi guys,
>
> You folks may not have looked at 1.7 yet, but it has a full pipeline, and
> is expected to have a Tika extractor as a transformation connector.
>
> Karl
>
>
>
> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <m....@sourcesense.com>
> wrote:
>
>> Thanks Alessandro,
>>         that explains the situation clearly.
>> And I agree that sending all the metadata as get parameter can be
>> problematic
>>
>> Cheers
>>
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>>
>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha scritto:
>>
>> > mmmm the point is that right now ManifoldCF has no extractors.
>> > The Repository connectors extracts directly the binary and there is no
>> > "Extractor Processor" yet.
>> > But recently a pipe-line processor architecture has been thought (
>> > https://issues.apache.org/jira/browse/CONNECTORS-959)
>> > So can fit there.
>> >
>> > Cheers
>> >
>> >
>> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m....@sourcesense.com>:
>> >
>> >> Since Solr extracting request handler takes the binary and extracts
>> text
>> >> what is the point of not using Manifold extractor and send text and
>> >> binaries to solr?
>> >> I mean the end result is the same solr indexes text and stores text
>> >> So if manifold supports text extraction it seems me this is the place
>> >> where it should be done
>> >>
>> >> --
>> >> Matteo Grolla
>> >> Sourcesense - making sense of Open Source
>> >> http://www.sourcesense.com
>> >>
>> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha
>> >> scritto:
>> >>
>> >>> Hi Matteo
>> >>>
>> >>> Manifold already handles the extraction, but the only way to send
>> binary
>> >>> content and document metadata to Solr is using the update/extract
>> >> handler,
>> >>> where the metadata is sent as query parameters and the binary content
>> is
>> >>> sent in the body of the requests, allowing Solr to use Tika to obtain
>> the
>> >>> raw content to be stored in Solr.
>> >>>
>> >>> Regards
>> >>>
>> >>>
>> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
>> m.grolla@sourcesense.com
>> >>>
>> >>> wrote:
>> >>>
>> >>>> Hi During my first indexing I noticed that manifold uses Solr
>> extracting
>> >>>> request handler to extract the content of an xml file
>> >>>> For performance reasons it would be better if Manifold handled the
>> >>>> extraction letting Solr do the search engine
>> >>>> Is this because of the connector design, framework design or just to
>> be
>> >>>> done?
>> >>>>
>> >>>> --
>> >>>> Matteo Grolla
>> >>>> Sourcesense - making sense of Open Source
>> >>>> http://www.sourcesense.com
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>>
>> >>> ------------------------------
>> >>> This message should be regarded as confidential. If you have received
>> >> this
>> >>> email in error please notify the sender and destroy it immediately.
>> >>> Statements of intent shall only become binding when confirmed in hard
>> >> copy
>> >>> by an authorised signatory.
>> >>>
>> >>> Zaizi Ltd is registered in England and Wales with the registration
>> number
>> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>> Road,
>> >>> London W6 7AN.
>> >>
>> >>
>> >
>> >
>> > --
>> > --------------------------
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>>
>>
>

Re: Solr Extracting request handler

Posted by Karl Wright <da...@gmail.com>.

Hi guys,

You folks may not have looked at 1.7 yet, but it has a full pipeline, and
is expected to have a Tika extractor as a transformation connector.

Karl



On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <m....@sourcesense.com>
wrote:

> Thanks Alessandro,
>         that explains the situation clearly.
> And I agree that sending all the metadata as get parameter can be
> problematic
>
> Cheers
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha scritto:
>
> > mmmm the point is that right now ManifoldCF has no extractors.
> > The Repository connectors extracts directly the binary and there is no
> > "Extractor Processor" yet.
> > But recently a pipe-line processor architecture has been thought (
> > https://issues.apache.org/jira/browse/CONNECTORS-959)
> > So can fit there.
> >
> > Cheers
> >
> >
> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m....@sourcesense.com>:
> >
> >> Since Solr extracting request handler takes the binary and extracts text
> >> what is the point of not using Manifold extractor and send text and
> >> binaries to solr?
> >> I mean the end result is the same solr indexes text and stores text
> >> So if manifold supports text extraction it seems me this is the place
> >> where it should be done
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha
> >> scritto:
> >>
> >>> Hi Matteo
> >>>
> >>> Manifold already handles the extraction, but the only way to send
> binary
> >>> content and document metadata to Solr is using the update/extract
> >> handler,
> >>> where the metadata is sent as query parameters and the binary content
> is
> >>> sent in the body of the requests, allowing Solr to use Tika to obtain
> the
> >>> raw content to be stored in Solr.
> >>>
> >>> Regards
> >>>
> >>>
> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> m.grolla@sourcesense.com
> >>>
> >>> wrote:
> >>>
> >>>> Hi During my first indexing I noticed that manifold uses Solr
> extracting
> >>>> request handler to extract the content of an xml file
> >>>> For performance reasons it would be better if Manifold handled the
> >>>> extraction letting Solr do the search engine
> >>>> Is this because of the connector design, framework design or just to
> be
> >>>> done?
> >>>>
> >>>> --
> >>>> Matteo Grolla
> >>>> Sourcesense - making sense of Open Source
> >>>> http://www.sourcesense.com
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> ------------------------------
> >>> This message should be regarded as confidential. If you have received
> >> this
> >>> email in error please notify the sender and destroy it immediately.
> >>> Statements of intent shall only become binding when confirmed in hard
> >> copy
> >>> by an authorised signatory.
> >>>
> >>> Zaizi Ltd is registered in England and Wales with the registration
> number
> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> >>> London W6 7AN.
> >>
> >>
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>
>

Re: Solr Extracting request handler

Posted by Matteo Grolla <m....@sourcesense.com>.

Thanks Alessandro,
	that explains the situation clearly.
And I agree that sending all the metadata as get parameter can be problematic

Cheers 

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha scritto:

> mmmm the point is that right now ManifoldCF has no extractors.
> The Repository connectors extracts directly the binary and there is no
> "Extractor Processor" yet.
> But recently a pipe-line processor architecture has been thought (
> https://issues.apache.org/jira/browse/CONNECTORS-959)
> So can fit there.
> 
> Cheers
> 
> 
> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m....@sourcesense.com>:
> 
>> Since Solr extracting request handler takes the binary and extracts text
>> what is the point of not using Manifold extractor and send text and
>> binaries to solr?
>> I mean the end result is the same solr indexes text and stores text
>> So if manifold supports text extraction it seems me this is the place
>> where it should be done
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha
>> scritto:
>> 
>>> Hi Matteo
>>> 
>>> Manifold already handles the extraction, but the only way to send binary
>>> content and document metadata to Solr is using the update/extract
>> handler,
>>> where the metadata is sent as query parameters and the binary content is
>>> sent in the body of the requests, allowing Solr to use Tika to obtain the
>>> raw content to be stored in Solr.
>>> 
>>> Regards
>>> 
>>> 
>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <m.grolla@sourcesense.com
>>> 
>>> wrote:
>>> 
>>>> Hi During my first indexing I noticed that manifold uses Solr extracting
>>>> request handler to extract the content of an xml file
>>>> For performance reasons it would be better if Manifold handled the
>>>> extraction letting Solr do the search engine
>>>> Is this because of the connector design, framework design or just to be
>>>> done?
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> ------------------------------
>>> This message should be regarded as confidential. If you have received
>> this
>>> email in error please notify the sender and destroy it immediately.
>>> Statements of intent shall only become binding when confirmed in hard
>> copy
>>> by an authorised signatory.
>>> 
>>> Zaizi Ltd is registered in England and Wales with the registration number
>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
>>> London W6 7AN.
>> 
>> 
> 
> 
> -- 
> --------------------------
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

mmmm the point is that right now ManifoldCF has no extractors.
The Repository connectors extracts directly the binary and there is no
"Extractor Processor" yet.
But recently a pipe-line processor architecture has been thought (
https://issues.apache.org/jira/browse/CONNECTORS-959)
So can fit there.

Cheers


2014-06-16 15:59 GMT+01:00 Matteo Grolla <m....@sourcesense.com>:

> Since Solr extracting request handler takes the binary and extracts text
> what is the point of not using Manifold extractor and send text and
> binaries to solr?
> I mean the end result is the same solr indexes text and stores text
> So if manifold supports text extraction it seems me this is the place
> where it should be done
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha
> scritto:
>
> > Hi Matteo
> >
> > Manifold already handles the extraction, but the only way to send binary
> > content and document metadata to Solr is using the update/extract
> handler,
> > where the metadata is sent as query parameters and the binary content is
> > sent in the body of the requests, allowing Solr to use Tika to obtain the
> > raw content to be stored in Solr.
> >
> > Regards
> >
> >
> > On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <m.grolla@sourcesense.com
> >
> > wrote:
> >
> >> Hi During my first indexing I noticed that manifold uses Solr extracting
> >> request handler to extract the content of an xml file
> >> For performance reasons it would be better if Manifold handled the
> >> extraction letting Solr do the search engine
> >> Is this because of the connector design, framework design or just to be
> >> done?
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >>
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Matteo Grolla <m....@sourcesense.com>.

Since Solr extracting request handler takes the binary and extracts text
what is the point of not using Manifold extractor and send text and binaries to solr?
I mean the end result is the same solr indexes text and stores text
So if manifold supports text extraction it seems me this is the place where it should be done

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales ha scritto:

> Hi Matteo
> 
> Manifold already handles the extraction, but the only way to send binary
> content and document metadata to Solr is using the update/extract handler,
> where the metadata is sent as query parameters and the binary content is
> sent in the body of the requests, allowing Solr to use Tika to obtain the
> raw content to be stored in Solr.
> 
> Regards
> 
> 
> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
> 
>> Hi During my first indexing I noticed that manifold uses Solr extracting
>> request handler to extract the content of an xml file
>> For performance reasons it would be better if Manifold handled the
>> extraction letting Solr do the search engine
>> Is this because of the connector design, framework design or just to be
>> done?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> 
> 
> -- 
> 
> ------------------------------
> This message should be regarded as confidential. If you have received this 
> email in error please notify the sender and destroy it immediately. 
> Statements of intent shall only become binding when confirmed in hard copy 
> by an authorised signatory.
> 
> Zaizi Ltd is registered in England and Wales with the registration number 
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
> London W6 7AN.

Re: Solr Extracting request handler

Posted by Alessandro Benedetti <be...@gmail.com>.

As the brilliant engineer that has preceded me wrote, it was a design
choice.
In my opinion this is a strong limitation as I would prefer to delegate the
extraction task to an intermediate processor instead of relying on Solr.
Furthermore I don't like to have to send all the metadata in the header (
and this can cause problems in the header size accepted from the server as
well if we have too much metadata extracted) .

Cheers




2014-06-16 15:51 GMT+01:00 Antonio David Perez Morales <ap...@zaizi.com>:

> Hi Matteo
>
> Manifold already handles the extraction, but the only way to send binary
> content and document metadata to Solr is using the update/extract handler,
> where the metadata is sent as query parameters and the binary content is
> sent in the body of the requests, allowing Solr to use Tika to obtain the
> raw content to be stored in Solr.
>
> Regards
>
>
> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
>
> > Hi During my first indexing I noticed that manifold uses Solr extracting
> > request handler to extract the content of an xml file
> > For performance reasons it would be better if Manifold handled the
> > extraction letting Solr do the search engine
> > Is this because of the connector design, framework design or just to be
> > done?
> >
> > --
> > Matteo Grolla
> > Sourcesense - making sense of Open Source
> > http://www.sourcesense.com
> >
> >
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Solr Extracting request handler

Posted by Antonio David Perez Morales <ap...@zaizi.com>.

Hi Matteo

Manifold already handles the extraction, but the only way to send binary
content and document metadata to Solr is using the update/extract handler,
where the metadata is sent as query parameters and the binary content is
sent in the body of the requests, allowing Solr to use Tika to obtain the
raw content to be stored in Solr.

Regards

On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <m....@sourcesense.com>
wrote:

> Hi During my first indexing I noticed that manifold uses Solr extracting
> request handler to extract the content of an xml file
> For performance reasons it would be better if Manifold handled the
> extraction letting Solr do the search engine
> Is this because of the connector design, framework design or just to be
> done?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN.