You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by lewis john mcgibbney <le...@apache.org> on 2018/09/17 20:14:50 UTC
Fwd: FW: September 2018 Newsletter - LDC

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
Date: Mon, Sep 17, 2018 at 12:39 PM
Subject: FW: September 2018 Newsletter - LDC
To: lewis john mcgibbney <le...@apache.org>






Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_1314009030]



 Dare Mighty Things



*From: *Ldc-customers1 <ld...@ldc.upenn.edu> on behalf of
Penn LDC <ld...@ldc.upenn.edu>
*Date: *Monday, September 17, 2018 at 12:09 PM
*To: *Penn LDC <ld...@ldc.upenn.edu>
*Subject: *September 2018 Newsletter - LDC



In this newsletter:


New Publications:

BOLT Information Retrieval Comprehensive Training and Evaluation
<https://catalog.ldc.upenn.edu/LDC2018T18>

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
<https://catalog.ldc.upenn.edu/LDC2018V01>

Multi-Language Conversational Telephone Speech 2011 -- Spanish
<https://catalog.ldc.upenn.edu/LDC2018S12>

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a
<https://catalog.ldc.upenn.edu/LDC2018S13>




New publications:



(1) BOLT Information Retrieval Comprehensive Training and Evaluation
<https://catalog.ldc.upenn.edu/LDC2018T18> was developed by LDC and
consists of all data produced in support of the Information Retrieval (IR
<https://www.ldc.upenn.edu/collaborations/current-projects/bolt/information-retrieval>)
task within the DARPA Broad Operational Language Translation (BOLT)
Program, including annotations, source documents and scoring software.



The BOLT IR task sought to support development of systems that could take
as input a natural language English query sentence, return relevant
responses to that query from a large corpus of informal documents in the
three BOLT languages (Arabic, Chinese, and English) and translate responses
from non-English documents into English. This release contains (1)
natural-language IR queries, system responses to queries, and
manually-generated assessment judgments for system responses; (2)
discussion forum source documents in Arabic, Chinese and English; (3)
scoring software for each evaluation phase; and (4) experimental data
developed in Phase 2.



BOLT Information Retrieval Comprehensive Training and Evaluation is
distributed via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $2,500.

*

(2) HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
<https://catalog.ldc.upenn.edu/LDC2018V01> was developed by LDC and is
comprised of approximately 53 hours of user-generated videos with
annotation and metadata. To advance multimodal event detection and related
technologies, LDC developed, in collaboration with NIST
<https://www.nist.gov/> (the National Institute of Standards and
Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC
<https://www.ldc.upenn.edu/collaborations/past-projects/havic> (the
Heterogeneous Audio Visual Internet Collection) that was used in the
NIST-sponsored MED
<https://www.nist.gov/itl/iad/mig/trecvid-multimedia-event-detection-evaluation-track>
(Multimedia Event Detection) task for several years. HAVIC MED Event
E051-E060 is a subset of that corpus, specifically, a collection of event
videos for the HAVIC Project originally released to support the 2016
Multimedia Event Detection task
<https://www.nist.gov/itl/iad/mig/med-2016-evaluation>.



The data consists of videos of various events (event videos) and videos
completely unrelated to events (background videos) harvested by a large
team of human annotators. Each event video was manually annotated with a
set of judgments describing its event properties and other salient
features. Background videos were labeled with topic and genre categories.



HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation is distributed
via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $2,000.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- Spanish
<https://catalog.ldc.upenn.edu/LDC2018S12> was developed by LDC and is
comprised of approximately 23 hours of telephone speech in Spanish.



The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation (
LRE <https://www.nist.gov/itl/iad/mig/2011-language-recognition-evaluation>).
Participants were recruited by native speakers who contacted acquaintances
in their social network. Those native speakers made one call, up to 15
minutes, to each acquaintance. Human auditors labeled the calls for callee
gender, dialect type, and noise.



LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:



·         Slavic Group (LDC2016S11
<https://catalog.ldc.upenn.edu/LDC2016S11>)

·         Turkish (LDC2017S09 <https://catalog.ldc.upenn.edu/LDC2017S09>)

·         South Asian (LDC2017S14 <https://catalog.ldc.upenn.edu/LDC2017S14>
)

·         Central Asian (LDC2018S03
<https://catalog.ldc.upenn.edu/LDC2018S03>)

·         Central European (LDC2018S08
<https://catalog.ldc.upenn.edu/LDC2018S08>)



Multi-Language Conversational Telephone Speech 2011 -- Spanish is
distributed via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $1,500.

*

(4) IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a
<https://catalog.ldc.upenn.edu/LDC2018S13> was developed by Appen
<http://www.appen.com/> for the IARPA (Intelligence Advanced Research
Projects Activity) Babel
<http://www.iarpa.gov/index.php/research-programs/babel> program. It
contains approximately 203 hours of Kazakh conversational and scripted
telephone speech collected in 2013 and 2014 along with corresponding
transcripts.



The Kazakh speech in this release represents that spoken in the
Northeastern and Southern dialect regions of Kazakhstan. The gender
distribution among speakers is approximately equal; speakers' ages range
from 16 years to 64 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the
street, a home or office, a public place, and inside a vehicle.



IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is available via web
download.



2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for $25.





Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104






-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc