You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2013/02/13 20:53:28 UTC

RE: Solrj/Tika question about content types

: questions still apply: since Tika apparently cares deeply about 
: content-type now, what content-type can I supply through SolrJ to tell 
: it 'please discover the document type on your own'?  And how do I do 
: that through SolrJ?

SolrJ sets the Content-Type header based on what is returned by he 
"getContentType()" of the ContentStream -- the default behavior is 
"application/octet-stream" if getContentType() returns null.

: (1) Does the getContentType() method actually even get used on Solrj?  
: When I looked at wire logging, it seemed that Solrj just posts a generic 
: "application/xml; charset=UTF-8" content type, and does not transmit 
: anything else.  It uses standard POST, not multipart/form POST, also.

Even in the case of a single ContentStream (so no multi-part) it still 
uses ContentStream.getContentType() ... can you provide a test case (or 
quick and dirty sample code) that demonstrates what you are seeing with 
"application/xml; charset=UTF-8" getting sent over the wire even though 
you explicitly provide a diff content-type in the ContentStream?


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Solrj/Tika question about content types

Posted by ka...@nokia.com.
Wow, Hoss, this post was so long ago I barely remember writing it. ;-)

The problem we were having is not that the content type is not set in SolrJ - it's that SolrCell does not discover it as it did when we used multipart posts and ran with Solr 3.6.  We still aren't sure where the change is that broke the Tika content-type-discovery functionality, or whether it is in Tika or in Solr, but we did set the content type in the content stream from the source, where possible, and that helped enormously.

The specific test case we had was an SJIS text file, which in Solr 3.6 is properly discovered to be SJIS, while in Solr 4.1 it is only discovered to be sjis if we set a content type other than application/octet-stream.

Karl


-----Original Message-----
From: ext Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Wednesday, February 13, 2013 2:53 PM
To: dev@lucene.apache.org
Subject: RE: Solrj/Tika question about content types


: questions still apply: since Tika apparently cares deeply about
: content-type now, what content-type can I supply through SolrJ to tell
: it 'please discover the document type on your own'?  And how do I do
: that through SolrJ?

SolrJ sets the Content-Type header based on what is returned by he "getContentType()" of the ContentStream -- the default behavior is "application/octet-stream" if getContentType() returns null.

: (1) Does the getContentType() method actually even get used on Solrj?  
: When I looked at wire logging, it seemed that Solrj just posts a generic
: "application/xml; charset=UTF-8" content type, and does not transmit
: anything else.  It uses standard POST, not multipart/form POST, also.

Even in the case of a single ContentStream (so no multi-part) it still uses ContentStream.getContentType() ... can you provide a test case (or quick and dirty sample code) that demonstrates what you are seeing with "application/xml; charset=UTF-8" getting sent over the wire even though you explicitly provide a diff content-type in the ContentStream?


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org