You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Kenneth Knowles (JIRA)" <ji...@apache.org> on 2017/05/15 18:40:04 UTC

[jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets

    [ https://issues.apache.org/jira/browse/BEAM-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011085#comment-16011085 ] 

Kenneth Knowles commented on BEAM-1439:
---------------------------------------

This was not selected as a GSOC project, but would still make a superb contribution to Beam's examples.

> Beam Example(s) exploring public document datasets
> --------------------------------------------------
>
>                 Key: BEAM-1439
>                 URL: https://issues.apache.org/jira/browse/BEAM-1439
>             Project: Beam
>          Issue Type: Wish
>          Components: examples-java
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Minor
>              Labels: gsoc2017, java, mentor, python
>
> In Beam, we have examples illustrating counting the occurrences of words and performing a basic TF-IDF analysis on the works of Shakespeare (or whatever you point it at). It would be even cooler to do these analyses, and more, on a much larger data set that is really the subject of current investigations.
> In chatting with professors at the University of Washington, I've learned that scholars of many fields would really like to explore new and highly customized ways of processing the growing body of publicly-available scholarly documents, such as PubMed Central. Queries like "show me documents where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some large-scale Beam pipelines to perform analyses such as term frequency, bigram frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)