You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2014/06/02 19:02:16 UTC

[Tika Wiki] Update of "RecursiveMetadata" by NickBurch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "RecursiveMetadata" page has been changed by NickBurch:
https://wiki.apache.org/tika/RecursiveMetadata?action=diff&rev1=6&rev2=7

Comment:
Show an example which tracks how far down you are

  === RecursiveMetadataParser Constructor ===
  {{{
     private static class RecursiveMetadataParser extends ParserDecorator {
- 
         public RecursiveMetadataParser(Parser parser) {
             super(parser);
         }
@@ -126, +125 @@

  
  By creating a new BodyContentHandler and passing that to {{{super.parse}}}, the text for each document is captured without mixing it with text from other documents.
  
+ = Tracking how far down the Rabbit Hole you have gone =
+ When using the code above, if you have a container format that contains another container, you may wish to keep track of where in the stack you are. To do that, you'd want code something like:
+ 
+ {{{
+    private static class RecursiveMetadataParser extends ParserDecorator {
+        private String location;
+        private int unknownCount = 0;
+ 
+        public RecursiveMetadataParser(Parser parser, String location) {
+            super(parser);
+            this.location = location;
+            if (! this.location.endsWith("/")) {
+               this.location += "/";
+            }
+        }
+ 
+        @Override
+        public void parse(
+                InputStream stream, ContentHandler ignore,
+                Metadata metadata, ParseContext context)
+                throws IOException, SAXException, TikaException {
+            // Work out what this thing is
+            String objectName = null;
+            if (metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY) != null) {
+               objectName = metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY);
+            } else if (metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID) != null) {
+               objectName = metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID);
+            } else {
+               objectName = "embedded-" + (++unknownCount);
+            }
+            String objectLocation = this.location + objectName;
+ 
+            // Fetch the contents, and recurse if possible
+            ContentHandler content = new BodyContentHandler();
+            context.set(Parser.class, new RecursiveMetadataParser(this.parser, objectLocation));
+            super.parse(stream, content, metadata, context);
+ 
+            // Report what this one is
+            System.out.println("----");
+            System.out.println("Resource is " + objectLocation);
+            System.out.println("----");
+            System.out.println(metadata);
+            System.out.println("----");
+            System.out.println(content.toString());
+        }
+    }
+ }}}
+ 
+ 
  = Surprise! Zips Have Text Too! =
  The great thing about AutoDetectParser is that it can parse and extract text from almost anything. In particular, it can parse zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, using Jukka's example code you can get the text and metadata for each file nested inside of the zip file. What you might not expect is that you also get metadata and body text for the zip file itself.