You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2010/11/30 18:09:04 UTC

Keyword extraction from pdf to text

Hello list,

I am currently attempting to extract keywords from pdf documents, my aim is then to begin constructing a domain ontology using the words which are extracted. I do not need to index anything at this stage, but wish to extract and push the output as plain text into a text file. An example of input text from the pdf document would be as follows
________________________________
6.1.3 Calculating carbon dioxide emissions for the proposed dwelling
The second calculation involves establishing the carbon dioxide emissions
for the proposed dwelling (DER). To do this the values proposed for the
dwelling should be used in the methodology i.e. the U-values, air infiltration,
heating system, etc.
The exceptions to entering the dwelling specific values are:
a. it may be assumed that all glazing is orientated east/west;
b. average overshading may be assumed if not known. 'Very little' shading
should not be entered;
c. 2 sheltered sides should be assumed if not known. More than 2 sheltered
sides should not be entered;
d. where secondary heating is proposed, if a chimney or flue is present but
no appliance installed the worst case should be assumed i.e. a decorative
fuel-effect gas appliance with 20% efficiency. If there is no gas point, an
open fire with 37% efficiency should be assumed, burning solid mineral
fuel for dwellings outwith a smokeless zone and smokeless solid mineral
fuel for those that are within such a zone.
All other values can be varied, but before entering values into the
methodology, reference should be made to:
* the back-stop U-values identified in guidance to standard 6.2; and
* guidance on systems and equipment within standards 6.3 to 6.6.
________________________________
My requirements are as follows

* drop stop words

* be able to pick up Bi Grams or NGrams such as the following "U-Values", "super-insulated", "air infiltration" etc,

* lower case filter

I have currently been using Lucene 3.0.1 with a custom filter to achieve the above bullet points, then using Luke to pick up phrases and entities from text by looking into the generated index, however I found that this was very time consuming. My intention is to pass the pdf document as input and receive the above as output which I can then use to manually construct my ontology from entities and their relationships.

I previously posted this to the Tika list with no response, so again I apologise if this is not a problem for the Lucene java list. Can anyone suggest a possible solution to the problem.

Any help would be great ;0) Thanks

Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: Keyword extraction from pdf to text

Posted by Ian Lea <ia...@gmail.com>.

If I've understood you correctly, you want to pump text into a lucene
Analyzer and grab the output and do something else with that.  If that
is right, you can use code based on something like this:

	for (String s : array-of-input-texts) {
	    Analyzer anl = new xxxAnalyzer(whatever);
	    TokenStream stream = anl.tokenStream("", new StringReader(s));
	    TermAttribute term = (TermAttribute)
stream.addAttribute(TermAttribute.class);
            while (stream.incrementToken()) {
	      System.out.println(" "+term.term());
	    }
	}

There is code in the second edition of Lucene In Action called
something like AnalyzerUtils that may have inspired the code above.
Reading a copy of that book is an excellent idea for any lucene
related topic.


--
Ian.


On Tue, Nov 30, 2010 at 5:09 PM, McGibbney, Lewis John
<Le...@gcu.ac.uk> wrote:
> Hello list,
>
> I am currently attempting to extract keywords from pdf documents, my aim is then to begin constructing a domain ontology using the words which are extracted. I do not need to index anything at this stage, but wish to extract and push the output as plain text into a text file. An example of input text from the pdf document would be as follows
> ________________________________
> 6.1.3 Calculating carbon dioxide emissions for the proposed dwelling
> The second calculation involves establishing the carbon dioxide emissions
> for the proposed dwelling (DER). To do this the values proposed for the
> dwelling should be used in the methodology i.e. the U-values, air infiltration,
> heating system, etc.
> The exceptions to entering the dwelling specific values are:
> a. it may be assumed that all glazing is orientated east/west;
> b. average overshading may be assumed if not known. 'Very little' shading
> should not be entered;
> c. 2 sheltered sides should be assumed if not known. More than 2 sheltered
> sides should not be entered;
> d. where secondary heating is proposed, if a chimney or flue is present but
> no appliance installed the worst case should be assumed i.e. a decorative
> fuel-effect gas appliance with 20% efficiency. If there is no gas point, an
> open fire with 37% efficiency should be assumed, burning solid mineral
> fuel for dwellings outwith a smokeless zone and smokeless solid mineral
> fuel for those that are within such a zone.
> All other values can be varied, but before entering values into the
> methodology, reference should be made to:
> * the back-stop U-values identified in guidance to standard 6.2; and
> * guidance on systems and equipment within standards 6.3 to 6.6.
> ________________________________
> My requirements are as follows
>
>
> *         drop stop words
>
> *         be able to pick up Bi Grams or NGrams such as the following "U-Values", "super-insulated", "air infiltration" etc,
>
> *         lower case filter
>
> I have currently been using Lucene 3.0.1 with a custom filter to achieve the above bullet points, then using Luke to pick up phrases and entities from text by looking into the generated index, however I found that this was very time consuming. My intention is to pass the pdf document as input and receive the above as output which I can then use to manually construct my ontology from entities and their relationships.
>
> I previously posted this to the Tika list with no response, so again I apologise if this is not a problem for the Lucene java list. Can anyone suggest a possible solution to the problem.
>
> Any help would be great ;0) Thanks
>
> Lewis
>
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org