You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2013/09/05 23:28:09 UTC

Solr Cell Question

Is it possible to configure solr cell to only extract and store the body of
a document when indexing?  I'm currently doing the following which I
thought would work

ModifiableSolrParams params = new ModifiableSolrParams();

 params.set("defaultField", "content");

 params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()");

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
"/update/extract");

 up.setParams(params);

 FileStream f = new FileStream(new File(".."));

 up.addContentStream(f);

up.setAction(ACTION.COMMIT, true, true);

solrServer.request(up);


But the result of content is as follows

<arr name="content_mvtxt">
<str/>
<str>null</str>
<str>ISO-8859-1</str>
<str>text/plain; charset=ISO-8859-1</str>
<str>Just a little test</str>
</arr>


What I had hoped for was just

<arr name="content_mvtxt">
<str>Just a little test</str>
</arr>

Re: Solr Cell Question

Posted by Jamie Johnson <je...@gmail.com>.
Thanks Erick,  This is how I was doing it but when I saw the Solr Cell
stuff I figured I'd give it a go.  What I ended up doing is the following

ModifiableSolrParams params = indexer.index(artifact);

 params.add("fmap.content", "my_custom_field");

 params.add("extractFormat", "text");

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
"/update/extract");

 up.setParams(params);

 FileStream f = new FileStream(new File(""));

 up.addContentStream(f);


On Fri, Sep 6, 2013 at 9:54 AM, Erick Erickson <er...@gmail.com>wrote:

> It's always frustrating when someone replies with "Why not do it
> a completely different way?".  But I will anyway :).
>
> There's no requirement at all that you send things to Solr to make
> Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
> anyway, why not just parse on the client? This has the advantage
> of allowing you to offload the Tika processing from Solr which can
> be quite expensive. You can use the same Tika jars that come
> with Solr or download whatever version from the Tika project
> you want. That way, you can exercise much better control over
> what's done.
>
> Here's a skeletal program with indexing from a DB mixed in, but
> it shouldn't be hard at all to pull the DB parts out.
>
> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>
> FWIW,
> Erick
>
>
> On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson <je...@gmail.com> wrote:
>
> > Is it possible to configure solr cell to only extract and store the body
> of
> > a document when indexing?  I'm currently doing the following which I
> > thought would work
> >
> > ModifiableSolrParams params = new ModifiableSolrParams();
> >
> >  params.set("defaultField", "content");
> >
> >  params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()");
> >
> >  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
> > "/update/extract");
> >
> >  up.setParams(params);
> >
> >  FileStream f = new FileStream(new File(".."));
> >
> >  up.addContentStream(f);
> >
> > up.setAction(ACTION.COMMIT, true, true);
> >
> > solrServer.request(up);
> >
> >
> > But the result of content is as follows
> >
> > <arr name="content_mvtxt">
> > <str/>
> > <str>null</str>
> > <str>ISO-8859-1</str>
> > <str>text/plain; charset=ISO-8859-1</str>
> > <str>Just a little test</str>
> > </arr>
> >
> >
> > What I had hoped for was just
> >
> > <arr name="content_mvtxt">
> > <str>Just a little test</str>
> > </arr>
> >
>

Re: Solr Cell Question

Posted by Erick Erickson <er...@gmail.com>.
It's always frustrating when someone replies with "Why not do it
a completely different way?".  But I will anyway :).

There's no requirement at all that you send things to Solr to make
Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
anyway, why not just parse on the client? This has the advantage
of allowing you to offload the Tika processing from Solr which can
be quite expensive. You can use the same Tika jars that come
with Solr or download whatever version from the Tika project
you want. That way, you can exercise much better control over
what's done.

Here's a skeletal program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

FWIW,
Erick


On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson <je...@gmail.com> wrote:

> Is it possible to configure solr cell to only extract and store the body of
> a document when indexing?  I'm currently doing the following which I
> thought would work
>
> ModifiableSolrParams params = new ModifiableSolrParams();
>
>  params.set("defaultField", "content");
>
>  params.set("xpath", "/xhtml:html/xhtml:body/descendant::node()");
>
>  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
> "/update/extract");
>
>  up.setParams(params);
>
>  FileStream f = new FileStream(new File(".."));
>
>  up.addContentStream(f);
>
> up.setAction(ACTION.COMMIT, true, true);
>
> solrServer.request(up);
>
>
> But the result of content is as follows
>
> <arr name="content_mvtxt">
> <str/>
> <str>null</str>
> <str>ISO-8859-1</str>
> <str>text/plain; charset=ISO-8859-1</str>
> <str>Just a little test</str>
> </arr>
>
>
> What I had hoped for was just
>
> <arr name="content_mvtxt">
> <str>Just a little test</str>
> </arr>
>