You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alexander Cougarman <ac...@bwc.org> on 2012/08/17 18:17:37 UTC

How to get raw text of a document

Hi. I asked this on the Tika group and the recommendation was to ask it here. I am using the following C# code to call Tika and would like it to return the raw text without any XML or JSON. So if the Word document contains "Hello World", this should return only that text and no XML or anything else to wrap it in -- just the raw text.

This code returns JSON of the XML, which in turn contains the text of the document. I need it to return the raw text only, no XML. Thanks.

var url = @"http://localhost:8983/solr/update/extract";

var client = new WebClient();
client.QueryString.Add("extractOnly","true");
client.QueryString.Add("wt","json");
var data = client.UploadFile(url, "input.txt"); 
var json = ASCIIEncoding.ASCII.GetString(data);



Sincerely,
Alex

Re: How to get raw text of a document

Posted by Jack Krupansky <ja...@basetechnology.com>.

The Javadocs should have the full list that are included with Solr. But, 
people can write their own.
http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/response/QueryResponseWriter.html

More info in general and wikis for specific resposne writers:
http://wiki.apache.org/solr/QueryResponseWriter#List_of_Writers_Available

-- Jack Krupansky

-----Original Message----- 
From: Alexander Cougarman
Sent: Saturday, August 18, 2012 11:02 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: How to get raw text of a document

Thanks, Jack. Where can I get a list of the "response writers" available 
with Solr?

Sincerely,
Alex

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: 17 August 2012 7:45 PM
To: solr-user@lucene.apache.org
Subject: Re: How to get raw text of a document

You need a "response writer" that returns only text. The "wt" paramter 
selects the response writer. You specified "json", so that's what you got.
Maybe "csv" would be closer to what you want.

-- Jack Krupansky

-----Original Message-----
From: Alexander Cougarman
Sent: Friday, August 17, 2012 12:17 PM
To: solr-user@lucene.apache.org
Subject: How to get raw text of a document

Hi. I asked this on the Tika group and the recommendation was to ask it 
here. I am using the following C# code to call Tika and would like it to 
return the raw text without any XML or JSON. So if the Word document 
contains "Hello World", this should return only that text and no XML or 
anything else to wrap it in -- just the raw text.

This code returns JSON of the XML, which in turn contains the text of the 
document. I need it to return the raw text only, no XML. Thanks.

var url = @"http://localhost:8983/solr/update/extract";

var client = new WebClient();
client.QueryString.Add("extractOnly","true");
client.QueryString.Add("wt","json");
var data = client.UploadFile(url, "input.txt"); var json = 
ASCIIEncoding.ASCII.GetString(data);



Sincerely,
Alex

RE: How to get raw text of a document

Posted by Alexander Cougarman <ac...@bwc.org>.

Thanks, Jack. Where can I get a list of the "response writers" available with Solr? 

Sincerely,
Alex 

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: 17 August 2012 7:45 PM
To: solr-user@lucene.apache.org
Subject: Re: How to get raw text of a document

You need a "response writer" that returns only text. The "wt" paramter selects the response writer. You specified "json", so that's what you got. 
Maybe "csv" would be closer to what you want.

-- Jack Krupansky

-----Original Message-----
From: Alexander Cougarman
Sent: Friday, August 17, 2012 12:17 PM
To: solr-user@lucene.apache.org
Subject: How to get raw text of a document

Hi. I asked this on the Tika group and the recommendation was to ask it here. I am using the following C# code to call Tika and would like it to return the raw text without any XML or JSON. So if the Word document contains "Hello World", this should return only that text and no XML or anything else to wrap it in -- just the raw text.

This code returns JSON of the XML, which in turn contains the text of the document. I need it to return the raw text only, no XML. Thanks.

var url = @"http://localhost:8983/solr/update/extract";

var client = new WebClient();
client.QueryString.Add("extractOnly","true");
client.QueryString.Add("wt","json");
var data = client.UploadFile(url, "input.txt"); var json = ASCIIEncoding.ASCII.GetString(data);

Sincerely,
Alex

Re: How to get raw text of a document

Posted by Jack Krupansky <ja...@basetechnology.com>.

You need a "response writer" that returns only text. The "wt" paramter 
selects the response writer. You specified "json", so that's what you got. 
Maybe "csv" would be closer to what you want.

-- Jack Krupansky

-----Original Message----- 
From: Alexander Cougarman
Sent: Friday, August 17, 2012 12:17 PM
To: solr-user@lucene.apache.org
Subject: How to get raw text of a document

Hi. I asked this on the Tika group and the recommendation was to ask it 
here. I am using the following C# code to call Tika and would like it to 
return the raw text without any XML or JSON. So if the Word document 
contains "Hello World", this should return only that text and no XML or 
anything else to wrap it in -- just the raw text.

This code returns JSON of the XML, which in turn contains the text of the 
document. I need it to return the raw text only, no XML. Thanks.

var url = @"http://localhost:8983/solr/update/extract";

var client = new WebClient();
client.QueryString.Add("extractOnly","true");
client.QueryString.Add("wt","json");
var data = client.UploadFile(url, "input.txt");
var json = ASCIIEncoding.ASCII.GetString(data);

Sincerely,
Alex