You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by 122jxgcn <yw...@gmail.com> on 2012/08/27 13:00:05 UTC
Solr adding header and footer to streamed documents
Hello,
I'm trying to write custom parser and add it to Tika, but I'm not very
successful right now.
As I have a binary file that converts custom file type into XML file,
I'm converting custom file to XML file inside my custom parser, then call
XMLParser inside the parser.
However, when I convert InputStream stream (inside parse function) to File,
it seems that Solr is adding header and footer that contains Metadata so the
file won't be converted properly.
(http://wiki.apache.org/solr/ExtractingRequestHandler#Metadata)
Following text is added as a header
1 0000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d ----------------
2 0000010: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 3139 --------------19
3 0000020: 3230 3862 3937 3764 6637 0d0a 436f 6e74 208b977df7..Cont
4 0000030: 656e 742d 4469 7370 6f73 6974 696f 6e3a ent-Disposition:
5 0000040: 2066 6f72 6d2d 6461 7461 3b20 6e61 6d65 form-data; name
6 0000050: 3d22 6d79 6669 6c65 223b 2066 696c 656e ="myfile"; filen
7 0000060: 616d 653d 2268 7770 322e 6877 7022 0d0a ame="hwp2.hwp"..
8 0000070: 436f 6e74 656e 742d 5479 7065 3a20 6170 Content-Type: ap
9 0000080: 706c 6963 6174 696f 6e2f 6f63 7465 742d plication/octet-
10 0000090: 7374 7265 616d 0d0a 0d0a d0cf 11e0 a1b1 stream
Following text is added as a footer
554 0002290: 0000 0000 0000 0000 0000 0d0a 2d2d 2d2d ............----
555 00022a0: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d ----------------
556 00022b0: 2d2d 2d2d 2d2d 2d2d 2d2d 3139 3230 3862 ----------19208b
557 00022c0: 3937 3764 6637 2d2d 0d0a 977df7--..
How can I prevent Solr from adding headers and footers?
Thank you.
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-adding-header-and-footer-to-streamed-documents-tp4003439.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr adding header and footer to streamed documents
Posted by Chris Hostetter <ho...@fucit.org>.
: However, when I convert InputStream stream (inside parse function) to File,
: it seems that Solr is adding header and footer that contains Metadata so the
: file won't be converted properly.
...
It's not totally clear from your problem description, but i *think* you
are saying that you are using SolrJ to stream these special XML files you
created to Solr, and then ou are using a custom parser registered with
Tika/ExtractingRequestHandler to parse them into documents. The output
you've pasted below appears to be a HEX dump of the raw HTTP stream from
this communicaiton.
Solr isn't adding any header/footer to your XML files, what you are seeing
are the normal HTTP headers added to a file when using MIME to send
multiple files. You may also occasically notice "chunked encoding"
markers used to stream arbitrary amounts of data over HTTP w/o requiring
the clients to pre-calculate the total "Content-Length". This is all
happening at the HTTP protocol level, and will be dealt with by the
HttpClient and Servlet Container before Solr ever sees the InputStreams --
let alone hands them to Tika -- so it should be completley transparent to
you (unless you go sniffing the wire like this)
If you are encountering an actual problem, then you need to give us a lot
more details about how you are using SolrJ/Solr, what servlet container
you are using, what your custom parser code looks like, and what kind of
errors you are getting, so someone can try to reproduce the problem.
: Following text is added as a header
:
: 1 0000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d ----------------
: 2 0000010: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 3139 --------------19
: 3 0000020: 3230 3862 3937 3764 6637 0d0a 436f 6e74 208b977df7..Cont
: 4 0000030: 656e 742d 4469 7370 6f73 6974 696f 6e3a ent-Disposition:
: 5 0000040: 2066 6f72 6d2d 6461 7461 3b20 6e61 6d65 form-data; name
: 6 0000050: 3d22 6d79 6669 6c65 223b 2066 696c 656e ="myfile"; filen
: 7 0000060: 616d 653d 2268 7770 322e 6877 7022 0d0a ame="hwp2.hwp"..
: 8 0000070: 436f 6e74 656e 742d 5479 7065 3a20 6170 Content-Type: ap
: 9 0000080: 706c 6963 6174 696f 6e2f 6f63 7465 742d plication/octet-
: 10 0000090: 7374 7265 616d 0d0a 0d0a d0cf 11e0 a1b1 stream
:
:
: Following text is added as a footer
:
: 554 0002290: 0000 0000 0000 0000 0000 0d0a 2d2d 2d2d ............----
: 555 00022a0: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d ----------------
: 556 00022b0: 2d2d 2d2d 2d2d 2d2d 2d2d 3139 3230 3862 ----------19208b
: 557 00022c0: 3937 3764 6637 2d2d 0d0a 977df7--..
-Hoss