You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2014/03/02 06:52:11 UTC
Re: CSCI ASSIGNMENT QUESTION

Hi Mohamed,

RE: #1 you are definitely headed in the right direction, but I can't
directly tell you if that's the "correct" number of files :)

RE: #2, go for it on the log4j issue.

Cheers,
Chris



-----Original Message-----
From: Mohamed Mustafa Rafik Khimani <kh...@usc.edu>
Date: Saturday, March 1, 2014 9:43 PM
To: Chris Mattmann <Ch...@jpl.nasa.gov>
Subject: Re: CSCI ASSIGNMENT QUESTION

>Hello professor Mattmann,
>
>
>Thank you for replying to my doubts.
>
>
>I realized there was a small mistake in the above code. I was updating
>the same pdf file count for every keyword that was matched for the same
>file, instead of updating the count only once for any of the keywords
>that matched a file.
>
>
>My output statistics were as follows:
>
>
>Keyword(s) used: UFO, flying disc, disc, saucer, extraterrestrial craft,
>flying saucer
>No of files processed: 2067
>No of files containing keyword(s): 1139
>
>
>No of occurrences of each keyword:
>----------------------------------
>UFO: 121
>flying disc: 6
>disc: 989
>saucer: 14
>extraterrestrial craft: 0
>flying saucer: 9
>
>
>
>Whereas, I think the correct output count should be as follows:
>
>
>Keyword(s) used: UFO, flying disc, disc, saucer, extraterrestrial craft,
>flying saucer
>No of files processed: 2067
>No of files containing keyword(s): 999
>
>
>No of occurrences of each keyword:
>----------------------------------
>UFO: 121
>flying disc: 6
>disc: 989
>saucer: 14
>extraterrestrial craft: 0
>flying saucer: 9
>
>
>
>Please let me know if my understanding is correct.
>
>
>I am yet to look at the Tika-93 issue, but I have a couple of doubts
>apart from that:
>
>
>1. In order to check if a keyword is present in the pdf file, I am using
>the "contains" method in String class
>
>
>                        //Use the parser to parse each PDF file
>parser.parse(is, handler, metadata, new ParseContext());
>
>//Get the content of the pdf files as a string
>String content = handler.toString();
>
>boolean contains = false;
>
>//For every keyword, we check to see if it is present in the file and
>update keyword_counts and num_fileswithkeywords accordingly
>for(String keyword:keywords)
>{
>if(content.contains(keyword))
>{
>
>
>
>I wanted to know if this was the correct way to do it, or may be I am
>missing something here ?
>
>
>2. I get a Log4j warning, each time I run my program:
>
>
>log4j:WARN No appenders could be found for logger
>(org.apache.pdfbox.pdfparser.XrefTrailerResolver).
>log4j:WARN Please initialize the log4j system properly.
>
>
>
>I looked up on the net and found a solution for it, but I would need to
>include the log4j jar file.
>I wanted to ask you if I should go ahead with this and also at the time
>of submission do I need to include the log4j jar file ? I understand the
>command to compile and run the program will change slightly and I will
>include that in the readme.txt file.
>
>
>I plan to resolve the above 2 points before I look to see the "check OCR
>quality before proceeding" step.
>
>
>Thank you so much for your time.
>
>
>Sincerely,
>
>
>Mohamed Mustafa Khimani
>
>
>
>
>
>On Sat, Feb 22, 2014 at 5:34 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>Hi Mohamed,
>
>Thank you for your question. Your code below looks like
>it's accomplishing the basics, and the requirements of
>assignment #1. BTW, I'm CC'ing dev@tika.apache.org.
>
>The (optional check OCR quality) refers to the fact that
>in Tika 1.5, we rely on PDF parsing code that doesn't always
>get the text chars correctly out of PDF files, so this "check
>OCR quality" step was suggesting that if you can figure out a
>way to check that you may want to. Or alternatively, you may
>want to check out TIKA-93 [1] and Grant's work there and help
>me test it out.
>
>Cheers,
>Chris
>
>[1] https://issues.apache.org/jira/browse/TIKA-93
>
>
>
>
>-----Original Message-----
>From: Mohamed Mustafa Rafik Khimani <kh...@usc.edu>
>
>Date: Saturday, February 22, 2014 5:23 PM
>To: Chris Mattmann <ma...@apache.org>
>Subject: Re: CSCI ASSIGNMENT QUESTION
>
>>Hello Professor Mattmann,
>>
>>
>>Thank you for your reply.
>>
>>
>>Currently, I am doing the following:
>>
>>
>>InputStream is = new BufferedInputStream(new FileInputStream(f));
>>
>>Parser parser = new PDFParser();
>>
>>ContentHandler handler = new BodyContentHandler(-1);
>>
>>
>>Metadata metadata = new Metadata();
>>
>>
>>                        //Use the parser to parse each PDF file
>>parser.parse(is, handler, metadata, new ParseContext());
>>
>>                        //Get the content of the pdf files as a string
>>String content = handler.toString();
>>
>>
>>
>>//For every keyword, we check to see if it is present in the file and
>>update keyword_counts and num_fileswithkeywords accordingly
>>
>>for(String keyword:keywords)
>>{
>>if(content.contains(keyword))
>>{
>>updatelog(keyword,f.getName());
>>int count = keyword_counts.get(keyword);
>>count++;
>>keyword_counts.put(keyword, count);
>>num_fileswithkeywords++;
>>}
>>}
>>
>>
>>
>>I do not understand what "(optional) check OCR quality before proceeding"
>>mean. Could you guide me where I can look for this.
>>
>>
>>The above code is processing all pdf files and printing the output as
>>needed. Could you please let me know if anything else is needed for the
>>assignment, except the optional OCR quality check.
>>
>>
>>Sincerely,
>>
>>
>>Mohamed Mustafa Khimani
>>
>>
>>
>>
>>
>>On Wed, Feb 19, 2014 at 7:33 PM, Chris Mattmann
>><ma...@apache.org> wrote:
>>
>>Thanks for your question Mohamed, feel free to send these
>>types of questions to dev@tika.apache.org. It would be a
>>great place to ask them and tell your classmates too.
>>
>>I'm copying the list on this message.
>>
>>(BTW you can then find the mail in Google and other
>>mail archives after that)
>>
>>Sometimes the MIME type is incorrectly detected, and
>>the best bet is to file a JIRA issue here in Tika:
>>
>>https://issues.apache.org/jira/browse/TIKA
>>
>>and then attach the sample PDF file for testing.
>>
>>If you have to preprocess a file in your specific
>>assignment in CS572, that's fine too you can just
>>force it to automatically call the PDF parser by
>>calling it directly from your program or Java code
>>and then bypass that step.
>>
>>HTH!
>>
>>Cheers,
>>Chris
>>
>>
>>------------------------
>>Chris Mattmann
>>chris.mattmann@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Mohamed Mustafa Rafik Khimani <kh...@usc.edu>
>>Date: Wednesday, February 19, 2014 12:56 PM
>>To: Chris Mattmann <ch...@gmail.com>
>>Subject: CSCI ASSIGNMENT QUESTION
>>
>>>Hello Professor Mattmann,
>>>I have a doubt regarding the Tika assignment. I was trying to read one
>>>of
>>>the pdf files downloaded from the vault. I was unable to read the file
>>>using Tika class and the parse method, which was returning null for each
>>>line.
>>>
>>>When I tried to use the detect method, to check the Mime type of the
>>>file, it returns audio/mpeg.
>>>
>>>I tried using one of the known pdf files, which returned the correct
>>>mime
>>>type as  well as was able to parse the file correctly.
>>>
>>>I wanted to confirm if I need to pre-process the file in anyway before I
>>>can extract the contents or if there might be a potential issue with the
>>>pdf files that I have downloaded, and may be consider re-downloading
>>>them
>>>?
>>>
>>>I am following the Tika in Action book. I have read the first 4 chapters
>>>and will be reading the content extraction chapter next. I was trying a
>>>few things while reading the text, so thought of asking you if this is
>>>expected or if I am going wrong somewhere.
>>>
>>>Thank you for your time.
>>>
>>>Sincerely,
>>>
>>>Mohamed Mustafa Khimani
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>