You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Cao, Renzhi (MU-Student)" <rc...@mail.missouri.edu> on 2015/10/21 14:45:25 UTC

Re: Questions about using the Tika

Dear all,
     I am interested in parsing the information (like name, skill,
location and etc) from the PDF resume, and I see that it seems Tika can do that. Could you please let me know if it is possible or any example of how to use Tika to parse the resume? Thank you very much for your help!

Renzhi Cao
Graduate Research Assistant
Department of Computer Science
University of Missouri-Columbia
Columbia, MO 65211
Cell: 573-825-8874
Email : rcrg4@mail.missouri.edu
http://web.missouri.edu/~rcrg4/

________________________________________
From: Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>
Sent: Wednesday, October 21, 2015 12:14 AM
To: Cao, Renzhi (MU-Student); dev-owner@tika.apache.org
Subject: Re: Questions about using the Tika

Please subscribe by sending email to dev-subscribe@tika.apache.org
and then once you are subscribed post the below to dev@tika.apache.org.

Cheers!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Cao, Renzhi (MU-Student)" <rc...@mail.missouri.edu>
Date: Tuesday, October 20, 2015 at 9:45 PM
To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: Questions about using the Tika

>Dear editor of Tika project,
>     I am interested in parsing the information (like name, skill,
>location and etc) from the PDF resume, and I see that it seems Tika can
>do that. Could you please let me know if it is possible or any example of
>how to use Tika to parse the resume? Thank
> you very much for your help!
>
>
>
>
>
>
>Renzhi Cao
>Graduate Research Assistant
>Department of Computer Science
>University of Missouri-Columbia
>Columbia, MO 65211
>Cell: 573-825-8874
>Email : rcrg4@mail.missouri.edu
><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5Qje
>ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%40m
>ail.missouri.edu>
>http://web.missouri.edu/~rcrg4/
>
>
>
>

Re: Questions about using the Tika

Posted by Konstantin Gribov <gr...@gmail.com>.
I think, you can take a look at GROBID[1] to see an approach to such
tagging problems. GROBID is designed to extract bibliographical data from
scientific publications (like authors, their affilations, abstract,
bibliographical links etc).

[1]: https://github.com/kermitt2/grobid

ср, 21 окт. 2015 г. в 16:21, Allison, Timothy B. <ta...@mitre.org>:

> Bouncing to user@tika...
>
> If the PDFs have fixed fields (AcroForm), then that should be easy enough
> to parse out of the xhtml that Tika produces, or you could go with straight
> PDFBox.
>
> If (as I suspect), these are free text resumes, then Tika can help pull
> out the text, but then you're on your own and off into the land of natural
> language processing (or some great regexes) to do the slot filling that
> you're looking for.
>
> Oh, wait, don't forget that there's a chance that you might find useful
> information in the metadata of the PDF: author, company etc., but I have no
> idea how reliable that would be.
>
> -----Original Message-----
> From: Cao, Renzhi (MU-Student) [mailto:rcrg4@mail.missouri.edu]
> Sent: Wednesday, October 21, 2015 8:45 AM
> To: Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>;
> dev-owner@tika.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: Questions about using the Tika
>
> Dear all,
>      I am interested in parsing the information (like name, skill,
> location and etc) from the PDF resume, and I see that it seems Tika can do
> that. Could you please let me know if it is possible or any example of how
> to use Tika to parse the resume? Thank you very much for your help!
>
> Renzhi Cao
> Graduate Research Assistant
> Department of Computer Science
> University of Missouri-Columbia
> Columbia, MO 65211
> Cell: 573-825-8874
> Email : rcrg4@mail.missouri.edu
> http://web.missouri.edu/~rcrg4/
>
> ________________________________________
> From: Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>
> Sent: Wednesday, October 21, 2015 12:14 AM
> To: Cao, Renzhi (MU-Student); dev-owner@tika.apache.org
> Subject: Re: Questions about using the Tika
>
> Please subscribe by sending email to dev-subscribe@tika.apache.org and
> then once you are subscribed post the below to dev@tika.apache.org.
>
> Cheers!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: "Cao, Renzhi (MU-Student)" <rc...@mail.missouri.edu>
> Date: Tuesday, October 20, 2015 at 9:45 PM
> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
> Subject: Questions about using the Tika
>
> >Dear editor of Tika project,
> >     I am interested in parsing the information (like name, skill,
> >location and etc) from the PDF resume, and I see that it seems Tika can
> >do that. Could you please let me know if it is possible or any example
> >of how to use Tika to parse the resume? Thank  you very much for your
> >help!
> >
> >
> >
> >
> >
> >
> >Renzhi Cao
> >Graduate Research Assistant
> >Department of Computer Science
> >University of Missouri-Columbia
> >Columbia, MO 65211
> >Cell: 573-825-8874
> >Email : rcrg4@mail.missouri.edu
> ><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5
> >Qje
> >ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%
> >40m
> >ail.missouri.edu>
> >http://web.missouri.edu/~rcrg4/
> >
> >
> >
> >
>
-- 
Best regards,
Konstantin Gribov

RE: Questions about using the Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Bouncing to user@tika...

If the PDFs have fixed fields (AcroForm), then that should be easy enough to parse out of the xhtml that Tika produces, or you could go with straight PDFBox.

If (as I suspect), these are free text resumes, then Tika can help pull out the text, but then you're on your own and off into the land of natural language processing (or some great regexes) to do the slot filling that you're looking for.

Oh, wait, don't forget that there's a chance that you might find useful information in the metadata of the PDF: author, company etc., but I have no idea how reliable that would be.

-----Original Message-----
From: Cao, Renzhi (MU-Student) [mailto:rcrg4@mail.missouri.edu] 
Sent: Wednesday, October 21, 2015 8:45 AM
To: Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>; dev-owner@tika.apache.org
Cc: dev@tika.apache.org
Subject: Re: Questions about using the Tika

Dear all,
     I am interested in parsing the information (like name, skill, location and etc) from the PDF resume, and I see that it seems Tika can do that. Could you please let me know if it is possible or any example of how to use Tika to parse the resume? Thank you very much for your help!

Renzhi Cao
Graduate Research Assistant
Department of Computer Science
University of Missouri-Columbia
Columbia, MO 65211
Cell: 573-825-8874
Email : rcrg4@mail.missouri.edu
http://web.missouri.edu/~rcrg4/

________________________________________
From: Mattmann, Chris A (3980) <ch...@jpl.nasa.gov>
Sent: Wednesday, October 21, 2015 12:14 AM
To: Cao, Renzhi (MU-Student); dev-owner@tika.apache.org
Subject: Re: Questions about using the Tika

Please subscribe by sending email to dev-subscribe@tika.apache.org and then once you are subscribed post the below to dev@tika.apache.org.

Cheers!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Cao, Renzhi (MU-Student)" <rc...@mail.missouri.edu>
Date: Tuesday, October 20, 2015 at 9:45 PM
To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: Questions about using the Tika

>Dear editor of Tika project,
>     I am interested in parsing the information (like name, skill, 
>location and etc) from the PDF resume, and I see that it seems Tika can 
>do that. Could you please let me know if it is possible or any example 
>of how to use Tika to parse the resume? Thank  you very much for your 
>help!
>
>
>
>
>
>
>Renzhi Cao
>Graduate Research Assistant
>Department of Computer Science
>University of Missouri-Columbia
>Columbia, MO 65211
>Cell: 573-825-8874
>Email : rcrg4@mail.missouri.edu
><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5
>Qje 
>ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%
>40m
>ail.missouri.edu>
>http://web.missouri.edu/~rcrg4/
>
>
>
>