You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by neotorand <ne...@gmail.com> on 2018/06/20 15:05:59 UTC

Indexing part of Binary Documents and not the entire contents

Hi List,
I have a specific Requirement where i need to index below things

Meta Data of any document
Some parts from the Document that matches some keywords that i configure

The first part i am able to achieve through ERH or FilelistEntityProcessor.

I am struggling on second part.I am looking for an effective and smart
approach to handle this.
Can any one give me a pointer or help with this.

Thanks in adavance!


Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

Posted by neotorand <ne...@gmail.com>.
Thanks Shawn,

Yes I agree ERH is never suggested in production.
I am writing my custom ones.
Any pointer with this?

What exactly i am looking is a custom indexing program to compile precisely
the information 
that you need and send that to Solr.
On the other hand i see the below method is very expensive if document size
is large.
 autoParser.parse(input, textHandler, metadata, context);

Because ContentHandler would hold the entire contents in memory.
Any suggestions?

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

Posted by neotorand <ne...@gmail.com>.
Gus
You are never biased.
I explored a bit about JesterJ. Looks quite promising.
I will keep you posted on my experience to you soon.

Regards
Neo




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

Posted by Gus Heck <gu...@gmail.com>.
You might consider using a free tool like JesterJ (www.jesterj.org) which
can possibly also automate the acquisition of the documents and
transmission to solr. As well as provide a framework for massaging the
contents of the document in between (including Tika processing)

(Disclaimer: I'm the primary author of JesterJ so I'ms slightly biased ;) )

-Gus

On Wed, Jun 27, 2018 at 5:08 AM, neotorand <ne...@gmail.com> wrote:

> Thanks Erick
> I already have gone through the link from tika example you shared.
> Please look at the code in bold.
> I believe still the entire contents is pushed to memory with handler
> object.
> sorry i copied lengthy code from tika site.
>
> Regards
> Neo
>
> *Streaming the plain text in chunks*
> Sometimes, you want to chunk the resulting text up, perhaps to output as
> you
> go minimising memory use, perhaps to output to HDFS files, or any other
> reason! With a small custom content handler, you can do that.
>
> public List<String> parseToPlainTextChunks() throws IOException,
> SAXException, TikaException {
>     final List<String> chunks = new ArrayList<>();
>     chunks.add("");
>     ContentHandlerDecorator handler = new ContentHandlerDecorator() {
>         @Override
>         public void characters(char[] ch, int start, int length) {
>             String lastChunk = chunks.get(chunks.size() - 1);
>             String thisStr = new String(ch, start, length);
>
>             if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
>                 chunks.add(thisStr);
>             } else {
>                 chunks.set(chunks.size() - 1, lastChunk + thisStr);
>             }
>         }
>     };
>
>     AutoDetectParser parser = new AutoDetectParser();
>     Metadata metadata = new Metadata();
>     try (InputStream stream =
> ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
>         *parser.parse(stream, handler, metadata);*
>         return chunks;
>     }
> }
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>



-- 
http://www.the111shift.com

Re: Indexing part of Binary Documents and not the entire contents

Posted by neotorand <ne...@gmail.com>.
Thanks Erick
I already have gone through the link from tika example you shared.
Please look at the code in bold.
I believe still the entire contents is pushed to memory with handler object.
sorry i copied lengthy code from tika site.

Regards
Neo

*Streaming the plain text in chunks*
Sometimes, you want to chunk the resulting text up, perhaps to output as you
go minimising memory use, perhaps to output to HDFS files, or any other
reason! With a small custom content handler, you can do that.

public List<String> parseToPlainTextChunks() throws IOException,
SAXException, TikaException {
    final List<String> chunks = new ArrayList<>();
    chunks.add("");
    ContentHandlerDecorator handler = new ContentHandlerDecorator() {
        @Override
        public void characters(char[] ch, int start, int length) {
            String lastChunk = chunks.get(chunks.size() - 1);
            String thisStr = new String(ch, start, length);
 
            if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
                chunks.add(thisStr);
            } else {
                chunks.set(chunks.size() - 1, lastChunk + thisStr);
            }
        }
    };
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream =
ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        *parser.parse(stream, handler, metadata);*
        return chunks;
    }
}



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

Posted by Erick Erickson <er...@gmail.com>.
Well, if you were using ERH you'd have the same problem as it uses
Tika. At least if you run Tika on some client somewhere, if you do
have a document that blows out memory or has some other problem, your
client can crash without taking Solr with it.

That's one of the reasons, in fact, that we don't recommend running ERH in prod.

And I should point out that this is not a flaw in Tika. Rather the
problem Tika has to cope with is immense.

And even a cursory look at Tika shows a streaming interface, see:
https://tika.apache.org/1.8/examples.html#Streaming_the_plain_text_in_chunks

Best,
Erick

On Tue, Jun 26, 2018 at 6:28 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/26/2018 7:13 AM, neotorand wrote:
>>
>> Dont you think the below method is very exepensive
>>
>> autoParser.parse(input, textHandler, metadata, context);
>>
>> If the document size if bigger than it will need enough memory to hold the
>> document(ie ContentHandler).
>> Any other alternative?
>
>
> I did find this:
>
> https://stackoverflow.com/questions/25043720/using-poi-or-tika-to-extract-text-stream-to-stream-without-loading-the-entire-f
>
> But I have no actual experience with Tika.  If you want to get a definitive
> answer, you will need to go to a Tika support resource.  Although Solr does
> incorporate Tika, we are not experts in its use.
>
> Thanks,
> Shawn
>

Re: Indexing part of Binary Documents and not the entire contents

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/26/2018 7:13 AM, neotorand wrote:
> Dont you think the below method is very exepensive
>
> autoParser.parse(input, textHandler, metadata, context);
>
> If the document size if bigger than it will need enough memory to hold the
> document(ie ContentHandler).
> Any other alternative?

I did find this:

https://stackoverflow.com/questions/25043720/using-poi-or-tika-to-extract-text-stream-to-stream-without-loading-the-entire-f

But I have no actual experience with Tika.  If you want to get a 
definitive answer, you will need to go to a Tika support resource.  
Although Solr does incorporate Tika, we are not experts in its use.

Thanks,
Shawn


Re: Indexing part of Binary Documents and not the entire contents

Posted by neotorand <ne...@gmail.com>.
Thanks Erick,

Though i saw this article in several places but never went through it
seriously.

Dont you think the below method is very exepensive

autoParser.parse(input, textHandler, metadata, context);


If the document size if bigger than it will need enough memory to hold the
document(ie ContentHandler).
Any other alternative?

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Indexing part of Binary Documents and not the entire contents

Posted by Erick Erickson <er...@gmail.com>.
This may help you get started:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Jun 21, 2018 at 8:11 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 6/20/2018 9:05 AM, neotorand wrote:
>>
>> I have a specific Requirement where i need to index below things
>>
>> Meta Data of any document
>> Some parts from the Document that matches some keywords that i configure
>>
>> The first part i am able to achieve through ERH or
>> FilelistEntityProcessor.
>>
>> I am struggling on second part.I am looking for an effective and smart
>> approach to handle this.
>> Can any one give me a pointer or help with this.
>
>
> Write a custom indexing program to compile precisely the information that
> you need and send that to Solr.
>
> Yes, that is a serious suggestion.  Solr itself is very capable, but it
> can't do everything that every user's specific business requirements
> dictate.  A large percentage of Solr users have written custom indexing
> programs.
>
> It is strongly recommended that the ExtractingRequestHandler never be used
> in production, because the Tika software it utilizes is prone to serious
> problems that might extend as far as an actual program crash.  If Tika
> crashes and it's running inside Solr, then Solr crashes too.  Running Tika
> in a custom indexing program instead is recommended, so that if it crashes,
> it's only the indexing program that dies, not Solr.
>
> Thanks,
> Shawn
>

Re: Indexing part of Binary Documents and not the entire contents

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/20/2018 9:05 AM, neotorand wrote:
> I have a specific Requirement where i need to index below things
>
> Meta Data of any document
> Some parts from the Document that matches some keywords that i configure
>
> The first part i am able to achieve through ERH or FilelistEntityProcessor.
>
> I am struggling on second part.I am looking for an effective and smart
> approach to handle this.
> Can any one give me a pointer or help with this.

Write a custom indexing program to compile precisely the information 
that you need and send that to Solr.

Yes, that is a serious suggestion.  Solr itself is very capable, but it 
can't do everything that every user's specific business requirements 
dictate.  A large percentage of Solr users have written custom indexing 
programs.

It is strongly recommended that the ExtractingRequestHandler never be 
used in production, because the Tika software it utilizes is prone to 
serious problems that might extend as far as an actual program crash.  
If Tika crashes and it's running inside Solr, then Solr crashes too.  
Running Tika in a custom indexing program instead is recommended, so 
that if it crashes, it's only the indexing program that dies, not Solr.

Thanks,
Shawn