You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dhamu <dh...@gmail.com> on 2010/02/04 10:47:47 UTC
How to send web pages(urls) to solr cell via solrj?
Hi,
I am newbie to solr and exploring solr last few days.
I am using solr cell with tika for parsing, indexing and searching
Posting the rich text documents via Solrj.
My actual requirement is instead of using local documents(pdf, doc & docx),
i want to use webpages(urls for eg..,(http://www.apache.org)).
eg..,
req.addFile(new File("docs/mailing_lists.html"));
instead
req.url(new urlconnection("http://www.apache.org")
anything like the above is there in solrj.
Actually i am using curl for testing. it works fine
curl
"http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "stream.url=http://wiki.apache.org/solr/SolrConfigXml"
but i am in need to use otherthan curl.
Below code works fine for local document indexing and searching. But instead
i want to post urls.
here is my code.,
String url = "http://localhost:8983/solr";
SolrServer server = new CommonsHttpSolrServer(url);
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(
"/update/extract");
req.addFile(new File("docs/mailing_lists.html"));
req.setParam("literal.id", "index1");
req.setParam("uprefix", "attr_");
req.setParam("fmap.content", "attr_content");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList result = server.request(req);
assertNotNull("Couldn't upload index.pdf", result);
QueryResponse rsp = server.query(new SolrQuery("*:*"));
Assert.assertEquals(1, rsp.getResults().getNumFound());
any suggestion or answer will be appreciated.
--
View this message in context: http://old.nabble.com/How-to-send-web-pages%28urls%29-to-solr-cell-via-solrj--tp27450083p27450083.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to send web pages(urls) to solr cell via solrj?
Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,
I did not try this, but could you not read the URL client side and pass it to SolrJ as a ContentStream?
ContentStream urlStream = ContentStreamBase.URLStream("http://my.site/file.html");
req.addContentStream(urlStream);
--
Jan Høydahl - search architect
Cominvent AS - www.cominvent.com
On 4. feb. 2010, at 10.47, dhamu wrote:
>
> Hi,
> I am newbie to solr and exploring solr last few days.
> I am using solr cell with tika for parsing, indexing and searching
> Posting the rich text documents via Solrj.
> My actual requirement is instead of using local documents(pdf, doc & docx),
> i want to use webpages(urls for eg..,(http://www.apache.org)).
>
> eg..,
> req.addFile(new File("docs/mailing_lists.html"));
> instead
> req.url(new urlconnection("http://www.apache.org")
> anything like the above is there in solrj.
>
> Actually i am using curl for testing. it works fine
>
> curl
> "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
> -F "stream.url=http://wiki.apache.org/solr/SolrConfigXml"
>
> but i am in need to use otherthan curl.
> Below code works fine for local document indexing and searching. But instead
> i want to post urls.
>
> here is my code.,
>
> String url = "http://localhost:8983/solr";
> SolrServer server = new CommonsHttpSolrServer(url);
> ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(
> "/update/extract");
> req.addFile(new File("docs/mailing_lists.html"));
> req.setParam("literal.id", "index1");
> req.setParam("uprefix", "attr_");
> req.setParam("fmap.content", "attr_content");
> req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> NamedList result = server.request(req);
> assertNotNull("Couldn't upload index.pdf", result);
> QueryResponse rsp = server.query(new SolrQuery("*:*"));
> Assert.assertEquals(1, rsp.getResults().getNumFound());
>
> any suggestion or answer will be appreciated.
>
>
> --
> View this message in context: http://old.nabble.com/How-to-send-web-pages%28urls%29-to-solr-cell-via-solrj--tp27450083p27450083.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>