You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by ph...@free.fr on 2014/01/16 13:03:12 UTC

Application to parse and index PDFs


Hello,

I would like to develop an application which would index place and people names, dates, numbers and monetary amounts, among other things, contained in thousands of PDFs. People and place names would be looked up in gazeeters (ie, dictionaries) and dates, numbers and amounts would be normalized so as to be comparable (eg, find all PDFs whose contents contain dates > 20010101 and < 20100101).

The indexed documents would then be sorted according to certain criteria, eg, document name or date, so that searches made by users using a search engine (eg, SOLR), yield documents ordered according to those criteria.

The index would then be transferred to a search engine.

Can anyone tell me if Hadoop would be useful in any part of the development process of this application, eg, the index storage and sorting part?

Furthermore, are Hadoop crawlers, such as Nutch, good substitutes for parsing and indexing tools such as GATE, UIMA, OpenNLP? For instance, can you do the above token recognition tasks with such crawlers?

Many thanks.

Philippe