You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/09/24 14:04:05 UTC

Documents categorization

Hello,
we need to categorize our documents in 80 sectors. These documents are
resumes/cv.

We have many documents (more than 30k) but there is a problem.
Should we try to extract the job positions inside each resume and
categorize them or can we just add the entire document and categorize it in
one or more categories? (max 3 categories)

I think there is a lof o noising data that can give us many false positives
if we use the entire document. For example, the personal data, hobbies etc

BUT

I also know that extract every job position from all the documents will
take years!

Can anyone give me any workaround ?

Thank you so much!

Damiano