You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/10 12:07:07 UTC

[Jackrabbit Wiki] Update of "TextExtractorExamples" by JukkaZitting

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "TextExtractorExamples" page has been changed by JukkaZitting.
The comment on this change is: We use Tika now for type detection and text extraction.
http://wiki.apache.org/jackrabbit/TextExtractorExamples?action=diff&rev1=3&rev2=4

--------------------------------------------------

  ##language:en
  == Examples for writing your own TextExtractors ==
  
+ See [[http://tika.apache.org/|Apache Tika]]
- === Add Mime Types ===
- Make sure to extract "org\apache\jackrabbit\server\io\mimetypes.properties" from jackrabbit-jcr-server-*.jar and add the same "org\apache\jackrabbit\server\io\mimetypes.properties" to your web project's classes folder, then add mime types which are defined in your text extractor classes to the file. 
  
- {{{
- ...
- mht=message/rfc822
- msg=application/msoutlook
- csv=text/plain
- }}}
- 
- === Obtain Mime Type  ===
- To obtain mime type from a file path use {{{MimeResolver}}} when possible, you'd better maintain one instance as it will read the mimetypes.properties file in the construtor.
- 
- {{{
- public static MimeResolver mimeResolver = new MimeResolver();
- ...
- String contentType = mimeResolver.getMimeType(filePath);
- }}}
- 
- 
- === Ms Powerpoint ===
- To well support the text extraction of ms powerpoint files, code below could help you by leveraging Apache POI's HSLF component.
- 
- {{{
- /**
-  * Text extractor for Microsoft PowerPoint presentations.
-  */
- public class MsPowerPointTextExtractor extends AbstractTextExtractor {
- 
-     /**
-      * Force loading of dependent class.
-      */
-     static {
-         POIFSReader.class.getName();
-     }
- 
-     /**
-      * Creates a new <code>MsPowerPointTextExtractor</code> instance.
-      */
-     public MsPowerPointTextExtractor() {
-         super(new String[]{"application/vnd.ms-powerpoint",
-                            "application/mspowerpoint"});
-     }
- 
-     //-------------------------------------------------------< TextExtractor >
- 
-     /**
-      * {@inheritDoc}
-      */
-     public Reader extractText(InputStream stream,
-                               String type,
-                               String encoding) throws IOException {
-         try {
-         	
-         	CharArrayWriter writer = new CharArrayWriter();
-             SlideShow slideShow= new SlideShow(new HSLFSlideShow(stream));
-             Slide[] slides = slideShow.getSlides();
-             for (int i = 0; i < slides.length; i++) {
-             	Slide slide = slides[i];
-             	/* Optional */
-             	if(StringUtils.isNotEmpty(slide.getTitle())) {
-             		writer.append(slide.getTitle() + " ");
-             	}
-             	TextRun[] textRuns = slide.getTextRuns();
-             	for (int j = 0; j < textRuns.length; j++) {
-             		writer.append(textRuns[j].getText() + " ");
-             	}
-             }
-             
-             return new CharArrayReader(writer.toCharArray());
-             
-         } finally {
-             stream.close();
-         }
-     }
- }
- }}} 
- 
- === Ms Mhtml ===
- Mht files are actually based on "message/rfc822", so we could write {{{MsMHTMLTextExtractor}}} like this:
- 
- {{{
- public class MsMHTMLTextExtractor extends AbstractTextExtractor {
- 
- 	/**
- 	 * Creates a new <code>MsMHTMLTextExtractor</code> instance.
- 	 */
- 	public MsMHTMLTextExtractor() {
- 		super(new String[] { "message/rfc822" });
- 	}
- 
- 	// -------------------------------------------------------< TextExtractor >
- 
- 	/**
- 	 * {@inheritDoc}
- 	 */
- 	@SuppressWarnings("unchecked")
- 	public Reader extractText(InputStream stream, String type, String encoding)
- 			throws IOException {
- 		try {
- 			MimeMessage mm = new MimeMessage(null, stream);
- 			StringBuffer sb = new StringBuffer();
- 
- 			getMHTMLContent(mm, sb);
- 
- 			return new StringReader(sb.toString());
- 		} catch (Exception e) {
- 			return new StringReader("");
- 		} finally {
- 			stream.close();
- 		}
- 	}
- 
- 	/**
- 	 * Parse message/rfc822 part regressively
- 	 */
- 	public void getMHTMLContent(Part part, StringBuffer sb) throws Exception {
- 		
- 		if (part.isMimeType("text/plain")) {
- 			sb.append((String) part.getContent());
- 		} else if (part.isMimeType("text/html")) {
- 
- 			TransformerFactory factory = TransformerFactory.newInstance();
- 			Transformer transformer = factory.newTransformer();
- 			HTMLParser parser = new HTMLParser();
- 			SAXResult result = new SAXResult(new DefaultHandler());
- 
- 			SAXSource source = new SAXSource(parser, new InputSource(part
- 					.getInputStream()));
- 			transformer.transform(source, result);
- 
- 			sb.append(parser.getContents());
- 
- 		} else if (part.isMimeType("multipart/*")) {
- 			Multipart multipart = (Multipart) part.getContent();
- 			int counts = multipart.getCount();
- 			for (int i = 0; i < counts; i++) {
- 				getMHTMLContent(multipart.getBodyPart(i), sb);
- 			}
- 		} else if (part.isMimeType("message/rfc822")) {
- 			getMHTMLContent((Part) part.getContent(), sb);
- 		} else {
- 			//
- 		}
- 	}
- }
- }}}
- 
- === Ms Outlook ===
- Msg files are used by ms outlook to save email content which may be widely used in many organizations. Apache POI's HSMF component aims to support reading and writing msg files, but so far it can only read the plain text by using {{{MAPIMessage}}} utility class. If you need to extract attachments from msg files, I guess the POI Filesystem can help. 
- 
- Todo: msg example here~
-