You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2016/05/20 15:45:29 UTC

Fwd: May 2016 Newsletter – LDC

Hi Folks,
I've ended up primary JPL organizational rep for the linguistics data
consortium. They produce monthly newsletters (see below for most
recent) which I will be forwarding to dev@ Joshua from now on.
They are pretty cool, especially the new datasets they publish.
Lewis

---------- Forwarded message ----------
From: *Mcgibbney, Lewis J (398M)* <Le...@jpl.nasa.gov>
Date: Friday, May 20, 2016
Subject: Fwd: May 2016 Newsletter – LDC
To: "lewis.mcgibbney@gmail.com" <le...@gmail.com>

Sent from my iPhone

Begin forwarded message:

*From:* Linguistic Data Consortium <ldc@ldc.upenn.edu
<javascript:_e(%7B%7D,'cvml','ldc@ldc.upenn.edu');>>
*Date:* May 16, 2016 at 8:20:33 AM PDT
*To:* Linguistic Data Consortium <ldc@ldc.upenn.edu
<javascript:_e(%7B%7D,'cvml','ldc@ldc.upenn.edu');>>
*Subject:* *May 2016 Newsletter – LDC*

*In this newsletter:*

*LDC at LREC 2016*

*New publications:*

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
<#m_-2915229479963685663_SDP>

GALE Phase 4 Chinese Broadcast Conversation Speech
<#m_-2915229479963685663_GALE1>

GALE Phase 4 Chinese Broadcast Conversation Transcripts
<#m_-2915229479963685663_GALE2>

*LDC at LREC 2016*

LDC will attend the 10th Language Resource Evaluation Conference
(LREC2016), hosted by ELRA, the European Language Resource Association. The
conference will be held in Portorož, Slovenia from May 23-28 and features a
broad range of sessions on language resources and human language
technologies research. Seven LDC staff members will be presenting current
work on topics including trends in HLT research, building language
resources for autism spectrum disorders, data management plans, rapid
development of morphological analyzers for typologically diverse languages,
selection criteria for low resource language programs, multi-language
speech collection for NIST LRE, novel incentives for collecting data and
annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be
available on LDC’s Papers Page
<https://www.ldc.upenn.edu/language-resources/papers/ldc-papers>.

New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
<https://catalog.ldc.upenn.edu/LDC2016S03> consists of data, tools, system
results, and publications associated with the 2014 and 2015 tasks on
Broad-Coverage Semantic Dependency Parsing (SDP <http://sdp.delph-in.net/>)
conducted in conjunction with the International Workshop on Semantic
Evaluation (SemEval <http://alt.qcri.org/semeval2015/>) and was developed
by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic
analysis systems intended to explore the nature of meaning in language. It
evolved from the Senseval <http://www.senseval.org/> word sense
disambiguation series to include semantic analysis tasks outside of word
sense disambiguation.

This release is based on English, Chinese and Czech data from the following
resources: Treebank-2 LDC95T17 <https://catalog.ldc.upenn.edu/LDC95T7>,
Proposition Bank I LDC2004T14 <https://catalog.ldc.upenn.edu/LDC2004T14>,
NomBaank v 1.0 LDC2008T23 <https://catalog.ldc.upenn.edu/LDC2008T23> and
CCGBank LDC2005T13  <https://catalog.ldc.upenn.edu/LDC2005T13>(English);
Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21
<https://catalog.ldc.upenn.edu/LDC2013T21>) (Chinese); and Prague
Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01
<https://catalog.ldc.upenn.edu/LDC2006T01>) (Czech).

The results are presented as graphs in three target representations:
MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures
(PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional
target representation CCGbank was converted to semantic dependency graphs
(in the subdirectory ‘ccd’).

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed
via web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $400.

*

(2) GALE Phase 4 Chinese Broadcast Conversation Speech
<https://catalog.ldc.upenn.edu/LDC2016S03> was developed by LDC and is
comprised of approximately 172 hours of Mandarin Chinese broadcast
conversation speech collected in 2008 by LDC and Hong Kong University of
Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast
Conversation Transcripts (LDC2016T12
<http://catalog.ldc.upenn.edu/LDC2016T12>).

The broadcast conversation recordings in this release feature interviews,
call-in programs and roundtable discussions focusing principally on current
events and are contained in 236 audio files presented in FLAC
<http://flac.sourceforge.net/>-compressed Waveform Audio File format
(.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a
native Chinese speaker following Audit Procedure Specification Version 2.0
which is included in this release.

GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web
download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $2000.

*

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts
<https://catalog.ldc.upenn.edu/LDC2016T12> was developed by LDC and
contains transcriptions of approximately 172 hours of Chinese broadcast
conversation speech collected in 2008 by LDC and Hong Kong University of
Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast
Conversation Speech (LDC2016S03 <https://catalog.ldc.upenn.edu/LDC2016S03>).

The transcript files are in plain-text, tab-delimited format (TDF) with
UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by
transcription vendors under contract to LDC. Transcribers followed LDC’s
quick transcription guidelines (QTR) and quick rich transcription
specification (QRTR). QTR transcription consists of quick (near-) verbatim,
time-aligned transcripts plus speaker identification with minimal
additional mark-up. QRTR adds additional structural information such as
topic boundaries and manual sentence unit annotation.

GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via
web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1500.

-- 
Membership Office
Linguistic Data Consortium
University of Pennsylvania
3600 Market St. Suite 810
Philadelphia, PA 19130
Tel: 215-573-1275email:ldc@ldc.upenn.edu
<javascript:_e(%7B%7D,'cvml','email:ldc@ldc.upenn.edu');>
Fax: 215-573-2175

-- 
*Lewis*

Re: May 2016 Newsletter – LDC

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thanks Lewis. I’m also an org rep for NASA at LDC, and also via my
USC hat. Good show.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 5/20/16, 8:45 AM, "Lewis John Mcgibbney" <le...@gmail.com> wrote:

>Hi Folks,
>I've ended up primary JPL organizational rep for the linguistics data
>consortium. They produce monthly newsletters (see below for most
>recent) which I will be forwarding to dev@ Joshua from now on.
>They are pretty cool, especially the new datasets they publish.
>Lewis
>
>---------- Forwarded message ----------
>From: *Mcgibbney, Lewis J (398M)* <Le...@jpl.nasa.gov>
>Date: Friday, May 20, 2016
>Subject: Fwd: May 2016 Newsletter – LDC
>To: "lewis.mcgibbney@gmail.com" <le...@gmail.com>
>
>
>
>
>Sent from my iPhone
>
>Begin forwarded message:
>
>*From:* Linguistic Data Consortium <ldc@ldc.upenn.edu
><javascript:_e(%7B%7D,'cvml','ldc@ldc.upenn.edu');>>
>*Date:* May 16, 2016 at 8:20:33 AM PDT
>*To:* Linguistic Data Consortium <ldc@ldc.upenn.edu
><javascript:_e(%7B%7D,'cvml','ldc@ldc.upenn.edu');>>
>*Subject:* *May 2016 Newsletter – LDC*
>
>*In this newsletter:*
>
>*LDC at LREC 2016*
>
>
>
>*New publications:*
>
>SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
><#m_-2915229479963685663_SDP>
>
>
>GALE Phase 4 Chinese Broadcast Conversation Speech
><#m_-2915229479963685663_GALE1>
>
>
>GALE Phase 4 Chinese Broadcast Conversation Transcripts
><#m_-2915229479963685663_GALE2>
>
>
>
>
>
>*LDC at LREC 2016*
>
>
>
>LDC will attend the 10th Language Resource Evaluation Conference
>(LREC2016), hosted by ELRA, the European Language Resource Association. The
>conference will be held in Portorož, Slovenia from May 23-28 and features a
>broad range of sessions on language resources and human language
>technologies research. Seven LDC staff members will be presenting current
>work on topics including trends in HLT research, building language
>resources for autism spectrum disorders, data management plans, rapid
>development of morphological analyzers for typologically diverse languages,
>selection criteria for low resource language programs, multi-language
>speech collection for NIST LRE, novel incentives for collecting data and
>annotation from people, and more.
>
>
>
>Following the conference, LDC’s presented papers and posters will be
>available on LDC’s Papers Page
><https://www.ldc.upenn.edu/language-resources/papers/ldc-papers>.
>
>
>
>
>
>New Corpora
>
>
>
>(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
><https://catalog.ldc.upenn.edu/LDC2016S03> consists of data, tools, system
>results, and publications associated with the 2014 and 2015 tasks on
>Broad-Coverage Semantic Dependency Parsing (SDP <http://sdp.delph-in.net/>)
>conducted in conjunction with the International Workshop on Semantic
>Evaluation (SemEval <http://alt.qcri.org/semeval2015/>) and was developed
>by the SDP task organizers.
>
>SemEval is an ongoing series of evaluations of computational semantic
>analysis systems intended to explore the nature of meaning in language. It
>evolved from the Senseval <http://www.senseval.org/> word sense
>disambiguation series to include semantic analysis tasks outside of word
>sense disambiguation.
>
>This release is based on English, Chinese and Czech data from the following
>resources: Treebank-2 LDC95T17 <https://catalog.ldc.upenn.edu/LDC95T7>,
>Proposition Bank I LDC2004T14 <https://catalog.ldc.upenn.edu/LDC2004T14>,
>NomBaank v 1.0 LDC2008T23 <https://catalog.ldc.upenn.edu/LDC2008T23> and
>CCGBank LDC2005T13  <https://catalog.ldc.upenn.edu/LDC2005T13>(English);
>Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21
><https://catalog.ldc.upenn.edu/LDC2013T21>) (Chinese); and Prague
>Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01
><https://catalog.ldc.upenn.edu/LDC2006T01>) (Czech).
>
>The results are presented as graphs in three target representations:
>MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures
>(PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional
>target representation CCGbank was converted to semantic dependency graphs
>(in the subdirectory ‘ccd’).
>
>SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed
>via web download.
>
>2016 Subscription Members will automatically receive two copies of this
>corpus. 2016 Standard Members may request a copy as part of their 16 free
>membership corpora. Non-members may license this data for US $400.
>
>
>
>*
>
>(2) GALE Phase 4 Chinese Broadcast Conversation Speech
><https://catalog.ldc.upenn.edu/LDC2016S03> was developed by LDC and is
>comprised of approximately 172 hours of Mandarin Chinese broadcast
>conversation speech collected in 2008 by LDC and Hong Kong University of
>Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous
>Language Exploitation) Program.
>
>Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast
>Conversation Transcripts (LDC2016T12
><http://catalog.ldc.upenn.edu/LDC2016T12>).
>
>The broadcast conversation recordings in this release feature interviews,
>call-in programs and roundtable discussions focusing principally on current
>events and are contained in 236 audio files presented in FLAC
><http://flac.sourceforge.net/>-compressed Waveform Audio File format
>(.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a
>native Chinese speaker following Audit Procedure Specification Version 2.0
>which is included in this release.
>
>GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web
>download.
>
>
>
>2016 Subscription Members will automatically receive two copies of this
>corpus. 2016 Standard Members may request a copy as part of their 16 free
>membership corpora. Non-members may license this data for US $2000.
>
>
>
>*
>
>(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts
><https://catalog.ldc.upenn.edu/LDC2016T12> was developed by LDC and
>contains transcriptions of approximately 172 hours of Chinese broadcast
>conversation speech collected in 2008 by LDC and Hong Kong University of
>Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous
>Language Exploitation) Program.
>
>Corresponding audio data is released as GALE Phase 4 Chinese Broadcast
>Conversation Speech (LDC2016S03 <https://catalog.ldc.upenn.edu/LDC2016S03>).
>
>The transcript files are in plain-text, tab-delimited format (TDF) with
>UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.
>
>The files in this corpus were transcribed by LDC staff and/or by
>transcription vendors under contract to LDC. Transcribers followed LDC’s
>quick transcription guidelines (QTR) and quick rich transcription
>specification (QRTR). QTR transcription consists of quick (near-) verbatim,
>time-aligned transcripts plus speaker identification with minimal
>additional mark-up. QRTR adds additional structural information such as
>topic boundaries and manual sentence unit annotation.
>
>GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via
>web download.
>
>2016 Subscription Members will automatically receive two copies of this
>corpus. 2016 Standard Members may request a copy as part of their 16 free
>membership corpora. Non-members may license this data for US $1500.
>
>
>-- 
>Membership Office
>Linguistic Data Consortium
>University of Pennsylvania
>3600 Market St. Suite 810
>Philadelphia, PA 19130
>Tel: 215-573-1275email:ldc@ldc.upenn.edu
><javascript:_e(%7B%7D,'cvml','email:ldc@ldc.upenn.edu');>
>Fax: 215-573-2175
>
>
>
>
>-- 
>*Lewis*