You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by lewis john mcgibbney <le...@apache.org> on 2019/08/18 22:30:57 UTC

Fwd: FW: [EXTERNAL] August 2019 Newsletter - LDC

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <le...@jpl.nasa.gov>
Date: Sun, Aug 18, 2019 at 3:29 PM
Subject: FW: [EXTERNAL] August 2019 Newsletter - LDC
To: lewis john mcgibbney <le...@apache.org>






Dr. Lewis John McGibbney Ph.D., B.Sc.(Hons)

Data Scientist III

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_841192643]



 Dare Mighty Things



*From: *Ldc-customers1 <ld...@ldc.upenn.edu> on behalf of
Penn LDC <ld...@ldc.upenn.edu>
*Date: *Thursday, August 15, 2019 at 8:30 AM
*To: *Penn LDC <ld...@ldc.upenn.edu>
*Subject: *[EXTERNAL] August 2019 Newsletter - LDC




*In this newsletter: **Fall 2019 LDC Data Scholarship Program*


*New Publications: *Corpus of Conversational Persian Transcripts
<https://catalog.ldc.upenn.edu/LDC2019T11>
TAC KBP Evaluation Source Corpora 2016-2017
<https://catalog.ldc.upenn.edu/LDC2019T12>
Multi-Language Conversational Telephone Speech 2011 -- East Asian
<https://catalog.ldc.upenn.edu/LDC2019S15>
IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c
<https://catalog.ldc.upenn.edu/LDC2019S16>



*Fall 2019 LDC Data Scholarship Program*

Students can apply for the Fall 2019 LDC Data Scholarship program now
through September 15, 2019. This scholarship program provides eligible
students with access to LDC data at no cost. For application requirements
and program rules, please visit the LDC Data Scholarship page
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.




* New publications:*

(1) Corpus of Conversational Persian Transcripts
<https://catalog.ldc.upenn.edu/LDC2019T11> contains transcripts from
approximately 20 hours of naturally occurring informal conversations in the
Tehrani dialect of Iranian Persian.

This data set is extracted from 1,201 minutes of conversations among 22
participants (12 male and 10 female) who recorded their daily phone calls
and face-to-face interactions in a variety of informal settings.
Conversations represent various interaction types (dialogue and group
conversation), settings (home, office, car, café and restaurant), types of
relationship (family, couple, friend, acquaintance), and various
communicative goals (joking, explaining, arguing, and complaining, among
others). The corresponding speech is not included in this release.

The transcripts were annotated for gender, age, recording method, and
setting.

Corpus of Conversational Persian Transcripts is distributed via web
download.

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $100.

*

(2) TAC KBP Evaluation Source Corpora 2016-2017
<https://catalog.ldc.upenn.edu/LDC2019T12> was developed by LDC and
contains the 180,003 Chinese, English, and Spanish source documents used in
support of all TAC KBP evaluation tracks conducted in 2016
<https://tac.nist.gov/2016/KBP/index.html> and 2017
<https://tac.nist.gov/2017/index.html>.

The source data consists of Chinese, English, and Spanish discussion forum
and newswire text collected by LDC. Also provided are a series of lists and
tables to aid in the recreation of specific test sets.

Text Analysis Conference (TAC <https://tac.nist.gov/>) is a series of
workshops organized by the National Institute of Standards and Technology (
NIST <https://www.nist.gov/>), developed to encourage research in natural
language processing and related applications. The Knowledge Base Population
(KBP) track of TAC encourages the development of systems that can match
entities mentioned in natural texts with those appearing in a knowledge
base and extract novel information about entities from a document
collection and add it to a new or existing knowledge base.

TAC KBP Evaluation Source Corpora 2016-2017 is distributed via web
download.

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $500.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- East Asian
<https://catalog.ldc.upenn.edu/LDC2019S15> was developed by LDC and is
comprised of approximately 19 hours of telephone speech in two distinct
languages of East Asia: Thai and Lao.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation (
LRE <https://www.nist.gov/itl/iad/mig/2011-language-recognition-evaluation>).
Participants were recruited by native speakers who contacted acquaintances
in their social network. Those native speakers made one call, up to 15
minutes, to each acquaintance. Calls are labeled by human auditors for
callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

·         Slavic Group (LDC2016S11
<https://catalog.ldc.upenn.edu/LDC2016S11>)

·         Turkish (LDC2017S09 <https://catalog.ldc.upenn.edu/LDC2017S09>)

·         South Asian (LDC2017S14 <https://catalog.ldc.upenn.edu/LDC2017S14>
)

·         Central Asian (LDC2018S03
<https://catalog.ldc.upenn.edu/LDC2018S03>)

·         Central European (LDC2018S08
<https://catalog.ldc.upenn.edu/LDC2018S08>)

·         Spanish (LDC2018S12 <https://catalog.ldc.upenn.edu/LDC2018S12>)

·         Arabic (LDC2019S02 <https://catalog.ldc.upenn.edu/LDC2019S02>)

·         English (LDC2019S06 <https://catalog.ldc.upenn.edu/LDC2019S06>)



Multi-Language Conversational Telephone Speech 2011 -- East Asian is
distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $1500.

*

(4) IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c
<https://catalog.ldc.upenn.edu/LDC2019S16> was developed by Appen
<http://www.appen.com/> for the IARPA (Intelligence Advanced Research
Projects Activity) Babel
<http://www.iarpa.gov/index.php/research-programs/babel> program. It
contains approximately 207 hours of Igbo conversational and scripted
telephone speech collected in 2014 and 2015 along with corresponding
transcripts.

The Igbo speech in this release represents the Owerri, Onitsha, and Ngwa
dialects spoken in Nigeria. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 67 years. Calls
were made using different telephones (e.g., mobile, landline) from a
variety of environments including the street, a home or office, a public
place, and inside a vehicle.

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c is distributed via web
download.

2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for $25.



*



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104










-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc