You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Liz Sommers <li...@gmail.com> on 2014/03/21 17:56:35 UTC

SolrCell and indexing HTML

I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell.  (The code is written in groovy).

The relevant code is here

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");

        req.setParam("literal.id",p.id.toString())
        req.setParam("extractOnly","true")
        URL url = new URL(p.url)
        ContentStream stream = new ContentStreamBase.URLStream(url)
        req.addContentStream(stream)

        def result = server.request(req)
        println "result: ${result}"


When I set extractOnly to true I get everything in the URL.  All the tags,
all the stylesheets.  When I set it to false I get a response that has
nothing in it except

result: {responseHeader={status=0,QTime=19}}

When I test it with the admin tools, nothing in the url has been indexed as
far as I can tell.
I know I am doing something wrong with the params, but I haven't figured
out what.  Can somebody please help me.

Thanks
Liz Sommers
lizzysom@gmail.com
lizsworks@gmail.com

Re: SolrCell and indexing HTML

Posted by Greg Walters <gr...@answers.com>.

I've never tried indexing via groovy or using solrCell but I think you might be working a bit too low level in solrj if you're just adding documents. You might try checking out https://wiki.apache.org/solr/Solrj#Adding_Data_to_Solr and I might be way off base :)

Thanks,
Greg

On Mar 21, 2014, at 11:56 AM, Liz Sommers <li...@gmail.com> wrote:

> I am trying to write a POC about indexing URL's with Solr using solrJ and
> solrCell.  (The code is written in groovy).
> 
> The relevant code is here
> 
> ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
> 
>        req.setParam("literal.id",p.id.toString())
>        req.setParam("extractOnly","true")
>        URL url = new URL(p.url)
>        ContentStream stream = new ContentStreamBase.URLStream(url)
>        req.addContentStream(stream)
> 
>        def result = server.request(req)
>        println "result: ${result}"
> 
> 
> When I set extractOnly to true I get everything in the URL.  All the tags,
> all the stylesheets.  When I set it to false I get a response that has
> nothing in it except
> 
> result: {responseHeader={status=0,QTime=19}}
> 
> When I test it with the admin tools, nothing in the url has been indexed as
> far as I can tell.
> I know I am doing something wrong with the params, but I haven't figured
> out what.  Can somebody please help me.
> 
> Thanks
> Liz Sommers
> lizzysom@gmail.com
> lizsworks@gmail.com

Re: SolrCell and indexing HTML

Posted by Jack Krupansky <ja...@basetechnology.com>.

The extractOnly option is simply telling you what the raw metadata is, while 
normal non-extractOnly mode is indexing meta exactly as you have requested 
it to be indexed. You haven't shown us any of your parameters that describe 
how you want the metadata indexed. If you didn't specify any mapping, it was 
probably all thrown away.

Read the tutorial on Solr Cell if you are not yet aware of how to map 
metadata:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Or read that chapter in my e-book! It has lots of examples, especially for 
the various mapping parameters.

-- Jack Krupansky

-----Original Message----- 
From: Liz Sommers
Sent: Friday, March 21, 2014 12:56 PM
To: solr-user
Subject: SolrCell and indexing HTML

I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell.  (The code is written in groovy).

The relevant code is here

ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");

        req.setParam("literal.id",p.id.toString())
        req.setParam("extractOnly","true")
        URL url = new URL(p.url)
        ContentStream stream = new ContentStreamBase.URLStream(url)
        req.addContentStream(stream)

        def result = server.request(req)
        println "result: ${result}"


When I set extractOnly to true I get everything in the URL.  All the tags,
all the stylesheets.  When I set it to false I get a response that has
nothing in it except

result: {responseHeader={status=0,QTime=19}}

When I test it with the admin tools, nothing in the url has been indexed as
far as I can tell.
I know I am doing something wrong with the params, but I haven't figured
out what.  Can somebody please help me.

Thanks
Liz Sommers
lizzysom@gmail.com
lizsworks@gmail.com