You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tstusr <ul...@gmail.com> on 2017/07/05 18:32:27 UTC

Best way to split text

We are working on a search application for large pdfs (~ 10 - 100 Mb), there
are been correctly indexed.

However we want to make some training in the pipeline, so we are
implementing some spark mllib algorithms.

But now, some requirements are to split documents into either paragraphs or
pages. Some alternatives, we find, is to split via tika-pdfbox or making a
custom processor to catch words.

In terms of performance, what options is preferred? A custom class of tika
that extracts just paragraphs or with all document filter paragraphs that
match our vocabulary.

Thanks for your advice.



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-split-text-tp4344498.html
Sent from the Solr - User mailing list archive at Nabble.com.