You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by Philippe de Rochambeau <ph...@free.fr> on 2015/02/19 21:28:06 UTC

Analysing archive PDFs

Hello,

In the past few months, I have indexed tens of thousands of PDFs containing newspaper articles from 1887 until 1940 using SOLR for my company.

Every day, my colleagues in the Archive Department spend hours searching through the archives using SOLR, looking for potentially-interesting articles from a social and historical point of view.

Can UIMA or OpenNLP be used to automate their work and/or to analyze patterns in the data?

Many thanks.

Philippe

Re: Analysing archive PDFs

Posted by Richard Eckart de Castilho <re...@apache.org>.

On 19.02.2015, at 21:28, Philippe de Rochambeau <ph...@free.fr> wrote:

> Hello,
> 
> In the past few months, I have indexed tens of thousands of PDFs containing newspaper articles from 1887 until 1940 using SOLR for my company.
> 
> Every day, my colleagues in the Archive Department spend hours searching through the archives using SOLR, looking for potentially-interesting articles from a social and historical point of view.
> 
> Can UIMA or OpenNLP be used to automate their work and/or to analyze patterns in the data?

I'd say that depends quite a bit on what kind of information your colleagues search for.
UIMA itself is just a framework to support unstructured information analysis. It does not
actually analyze text - that is the job of UIMA components. There are many UIMA components
for various kinds of tasks, in particular for natural language processing task. 

OpenNLP provides tools for basic linguistic analysis of texts such as part-of-speech tagging,
parsing, named entity recognition. OpenNLP provides some UIMA components. However, to use
OpenNLP effectively, you need to train models for it. Most models available for download from
the OpenNLP website give suboptimal results because they are trained only on small data sets.

If you look for patterns, then UIMA Ruta might help. You can implement patterns to detect and 
analyze certain kinds of information, e.g. bibliographic records or information from a CV.

Apart from what Apache UIMA has to offer, I these pointers might also be interesting to you: 

Topic modelling is a trending technology with respect to sieving through data and detecting
interesting things. There are many recent research publications on this topic. 

This video [1] recently twittered by me, so I might as well share it here.

A colleague of mine uses topic models to analyze historical school books [2]. As part of this,
we also built UIMA components in DKPro Core [3] to generate topic models using the Mallet library [4].

Cheers,

-- Richard

[1] http://nycdatascience.com/news/using-machine-learning-to-aid-journalism-at-the-new-york-times/
[2] https://www.ukp.tu-darmstadt.de/research/current-projects/welt-der-kinder/
[3] https://dkpro-core-asl.googlecode.com
[4] http://mallet.cs.umass.edu