You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by neerajp <ne...@yahoo.com> on 2013/12/09 11:55:57 UTC

Indexing on plain text and binary data in a single HTTP POST request

Hi, 
I am using Solr for searching my email data. My application is in C++ so I a
using CURL library to POST the data to Solr for indexing. I am posting data
in XML format and some of the XML fields are in plain text and some of the
fields are in binary format. I want to know what should I do so that Solr
can index both types of data (plain text as well as binary data) coming in a
single XML file. 

For the reference my XML file looks like: 
"<add><doc><field name=mailbox-id>1111</field><field
name=folder>INBOX</field><field name=from>solr solr
<so...@abc.com></field><field name=to>solr <so...@abc.com></field><field
name=email-body>HI I AM EMAIL BODY\r\n\r\nTHANKS</field><field
name=email-attachment>Some binary data</doc></add>"

I tried to use ExtractingUpdateProcessorFactory  but it seems to me that
ExtractingUpdateProcessorFactory support is not in Solr 4.5(which I am
using) even not in any of the Solr version available in market. 

Also, I think I can not use ExtractingRequestHandler for my problem as the
document is of type XML format and having mixed type of data(text and
binary). Am I right ?? If yes, pls. suggest me how to proceed and if no, how
can I  extract text using ExtractingRequestHandler from some of the binary
fields.

Any help is highly appreciated.....



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by Raymond Wiker <rw...@gmail.com>.

I would index all attachments separately, but with some sort of reference
back to the mail message. That way, I could use the update handler for the
text and metadata of the mail message, and the the update/extract handler
for the binary attachment(s) and a restricted set of metadata (file name,
content type, reference back to email message).

Note that, if you're implementing some sort of connector for indexing your
content, you cold handle the binary attachments on the connector side,
instead.

On Tue, Dec 10, 2013 at 8:41 AM, neerajp <ne...@yahoo.com> wrote:

> Pls. find my response in-line:
> Assuming that your binary fields are mime attachments to email messages,
> they will probably already be encoded as base 64.  Why not just leave
> them that way in solr too?  You can't do much with them other than store
> them right?  Or do you have some kind of image processing going on?  You
> can always decode them in your client when you pull them out.
>
> [Neeraj]: Yes, binary fields are mime attachments to email messages. But I
> want to index attachment.
> For that I need to convert base64 encoded data in binary format at Solr
> side
> and then by using some technique, I need to extract text out of it so that
> the text can be indexed and I can search inside attachment.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105860.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by neerajp <ne...@yahoo.com>.

Pls. find my response in-line:
Assuming that your binary fields are mime attachments to email messages, 
they will probably already be encoded as base 64.  Why not just leave 
them that way in solr too?  You can't do much with them other than store 
them right?  Or do you have some kind of image processing going on?  You 
can always decode them in your client when you pull them out.

[Neeraj]: Yes, binary fields are mime attachments to email messages. But I
want to index attachment.
For that I need to convert base64 encoded data in binary format at Solr side
and then by using some technique, I need to extract text out of it so that
the text can be indexed and I can search inside attachment.



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105860.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

On 12/9/2013 11:13 PM, neerajp wrote:
> Hi,
> Pls. find my response in-line:
>
> That said, the obvious alternative is to use /update/extract instead of
> /update – this gives you a way of handling up to one binary stream in
> addition to any number of fields that can be represented as text. In that
> case, you need to construct a POST request that sends the binary content as
> a file stream, and the other parameters as ordinary form data (actually, it
> may be possible to send some/all of the other fields as url parameters, but
> that does not really simplify things).
>
> [Neeraj]: I thought about this solution but it won't work in my solution as
> there are a lot text fields and size is also very significant. I am looking
> for some other suggestion
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105827.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Assuming that your binary fields are mime attachments to email messages, 
they will probably already be encoded as base 64.  Why not just leave 
them that way in solr too?  You can't do much with them other than store 
them right?  Or do you have some kind of image processing going on?  You 
can always decode them in your client when you pull them out.

-Mike

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by neerajp <ne...@yahoo.com>.

Hi,
Pls. find my response in-line:

That said, the obvious alternative is to use /update/extract instead of
/update – this gives you a way of handling up to one binary stream in
addition to any number of fields that can be represented as text. In that
case, you need to construct a POST request that sends the binary content as
a file stream, and the other parameters as ordinary form data (actually, it
may be possible to send some/all of the other fields as url parameters, but
that does not really simplify things). 

[Neeraj]: I thought about this solution but it won't work in my solution as
there are a lot text fields and size is also very significant. I am looking
for some other suggestion



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105827.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by Raymond Wiker <rw...@gmail.com>.

On 09 Dec 2013, at 17:20 , neerajp <ne...@yahoo.com> wrote:

> 
> 2) Your binary content is encoded in some way inside XML, right? Not just 
> random binary, which would make it invalid XML? Like base64 or something? 
> 
> [Neeraj]: I want to use random binary(*not base64 encoded*) in some of the
> XML fields inside CDATA tag so that XML will not become invalid. I hope I
> can do this. 

You can't – there are binary values that are simply not acceptable in an XML stream. Encoding the binary is the canonical way around this.

That said, the obvious alternative is to use /update/extract instead of /update – this gives you a way of handling up to one binary stream in addition to any number of fields that can be represented as text. In that case, you need to construct a POST request that sends the binary content as a file stream, and the other parameters as ordinary form data (actually, it may be possible to send some/all of the other fields as url parameters, but that does not really simplify things).

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by neerajp <ne...@yahoo.com>.

Thanks everybody for throwing your ideas.

So, I came to know that XML can not carry random binary data so I will
encode the data in base64 format.
Yes, I can write a custom URP which can convert the base64 encode fields to
binary fields. Now, I have binary fields in my document.* My question is
that how can I convert those binary fields to text so that Solr can index
them ? *



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105826.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by Shawn Heisey <so...@elyograg.org>.

On 12/9/2013 9:20 AM, neerajp wrote:
> I tried to use ExtractingUpdateProcessor but soon came to know that the same
> is not rolled out in solr 4.5
> I am not sure how to use ExtractingRequestHandler for an XML document having
> some of the fields in plain text and some of the fields in random binary
> format. It seems to me that ExtractingRequestHandler is used to extract text
> from a binary file input but my input document is in XML format not binary.

ExtractingRequestHandler is a contrib module.  It's not included in the 
Solr application war itself, but it IS in the download.  You can find 
the jars in contrib/extraction/lib in all 4.x versions, including 4.5, 
4.5.1, and 4.6.

Thanks,
Shawn

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by neerajp <ne...@yahoo.com>.

Hi Alexandre,
Thanks very much for responding my post. Pls. find my response in-line:

1) For your email address fields, you are escaping the brackets, right?
Not just "solr solr
<[hidden email]>" as you show, but the < and > escaped, right? Otherwise,
those email addresses become part of XML markup and mess it all up

[Neraj]: Yes, you are right. I used CDATA for escaping < and > or any
special characters in XML

2) Your binary content is encoded in some way inside XML, right? Not just
random binary, which would make it invalid XML? Like base64 or something?

[Neeraj]: I want to use random binary(*not base64 encoded*) in some of the
XML fields inside CDATA tag so that XML will not become invalid. I hope I
can do this.

3) To decode base64 as first step and to feed it through whatever you want
to process actually
binary with as a second step. So, it might be a custom URP, with similar
functionality to ExtractingRequestHandler with the difference that you
already have a document object and you are mapping one - binary - field in
it into a bunch of other fields with some conventions
on names, overrides, etc.

[Neeraj]: Now, My XML document is containing some of the fields in plain
text and some of the fields in random binary format.

I tried to use ExtractingUpdateProcessor but soon came to know that the same
is not rolled out in solr 4.5
I am not sure how to use ExtractingRequestHandler for an XML document having
some of the fields in plain text and some of the fields in random binary
format. It seems to me that ExtractingRequestHandler is used to extract text
from a binary file input but my input document is in XML format not binary.

I am new to Solr so need your valuable suggestion.

--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661p4105706.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing on plain text and binary data in a single HTTP POST request

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Not a solution, but a couple of thoughts:
1) For your email  address fields, you are escaping the brackets, right?
Not just "solr solr
<so...@abc.com>" as you show, but the < and > escaped, right? Otherwise,
those email addresses become part of XML markup and mess it all up
2) Your binary content is encoded in some way inside XML, right? Not just
random binary, which would make it invalid XML? Like base64 or something?
3) I suspect you will need to use UpdateRequestProcessor one way or
another. To decode base64 as first step and to feed it through whatever you
want to process actually binary with as a second step. So, it might be a
custom URP, with similar functionality to ExtractingRequestHandler with the
difference that you already have a document object and you are mapping one
- binary - field in it into a bunch of other fields with some conventions
on names, overrides, etc.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Dec 9, 2013 at 5:55 PM, neerajp <ne...@yahoo.com> wrote:

> Hi,
> I am using Solr for searching my email data. My application is in C++ so I
> a
> using CURL library to POST the data to Solr for indexing. I am posting data
> in XML format and some of the XML fields are in plain text and some of the
> fields are in binary format. I want to know what should I do so that Solr
> can index both types of data (plain text as well as binary data) coming in
> a
> single XML file.
>
> For the reference my XML file looks like:
> "<add><doc><field name=mailbox-id>1111</field><field
> name=folder>INBOX</field><field name=from>solr solr
> <so...@abc.com></field><field name=to>solr <so...@abc.com></field><field
> name=email-body>HI I AM EMAIL BODY\r\n\r\nTHANKS</field><field
> name=email-attachment>Some binary data</doc></add>"
>
> I tried to use ExtractingUpdateProcessorFactory  but it seems to me that
> ExtractingUpdateProcessorFactory support is not in Solr 4.5(which I am
> using) even not in any of the Solr version available in market.
>
> Also, I think I can not use ExtractingRequestHandler for my problem as the
> document is of type XML format and having mixed type of data(text and
> binary). Am I right ?? If yes, pls. suggest me how to proceed and if no,
> how
> can I  extract text using ExtractingRequestHandler from some of the binary
> fields.
>
> Any help is highly appreciated.....
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-on-plain-text-and-binary-data-in-a-single-HTTP-POST-request-tp4105661.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>