You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by lewis john mcgibbney <le...@apache.org> on 2017/03/17 19:20:43 UTC
Fwd: FW: March 2017 Newsletter -- LDC

Hi Team,
Please see below for LDC March Newsletter.
Lewis

---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
Date: Fri, Mar 17, 2017 at 12:16 PM
Subject: FW: March 2017 Newsletter -- LDC
To: Lewis John McGibbney <le...@gmail.com>






Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group 398M

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402 <(818)%20393-7402>

Cell: (+1) (626)-487-3476 <(626)%20487-3476>

Fax:  (+1) (818)-393-1190 <(818)%20393-1190>

Email: lewis.j.mcgibbney@jpl.nasa.gov







 Dare Mighty Things



*From: *Ldc-customers1 <ld...@ldc.upenn.edu> on behalf of
Penn LDC <ld...@ldc.upenn.edu>
*Date: *Friday, March 17, 2017 at 7:49 AM
*To: *Penn LDC <ld...@ldc.upenn.edu>
*Subject: *March 2017 Newsletter -- LDC



*In this newsletter*

BOLT Chinese Discussion Forum Parallel Training Data
<https://catalog.ldc.upenn.edu/LDC2017T05>

IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
<https://catalog.ldc.upenn.edu/LDC2017S05>

Noisy TIMIT Speech <https://catalog.ldc.upenn.edu/LDC2017S04>

GALE English-Chinese Parallel Aligned Treebank -- Training
<https://catalog.ldc.upenn.edu/LDC2017T06>

*New Corpora*

(1) BOLT Chinese Discussion Forum Parallel Training Data
<https://catalog.ldc.upenn.edu/LDC2017T05> was developed by LDC and
consists of 1,876,799 tokens of Chinese discussion forum data collected for
the DARPA BOLT program along with their corresponding English translations.

The BOLT <https://www.ldc.upenn.edu/collaborations/current-projects/bolt>
(Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting
informal data sources -- discussion forums, text messaging and chat -- in
Chinese, Egyptian Arabic and English. The collected data was translated and
annotated for various tasks including word alignment, treebanking,
propbanking and co-reference.

The source data in this release consists of discussion forum threads
harvested from the Internet by LDC using a combination of manual and
automatic processes. The full source data collection is released as BOLT
Chinese Discussion Forums (LDC2016T05
<https://catalog.ldc.upenn.edu/LDC2016T05>). Word-aligned and tagged data
is released as BOLT Chinese-English Word Alignment and Tagging - Discussion
Forum Training (LDC2016T19 <https://catalog.ldc.upenn.edu/LDC2016T19>).

BOLT Chinese Discussion Forum Parallel Training Data is distributed via web
download.



2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.

*

(2) IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
<https://catalog.ldc.upenn.edu/LDC2017S05> was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains approximately 200 hours of Swahili conversational and scripted
telephone speech collected from 2012-2014 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop
speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of
recorded speech.



The Swahili speech in this release represents that spoken in the Nairobi
dialect region of Kenya. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 65 years. Calls
were made using different telephones (e.g., mobile, landline) from a
variety of environments including the street, a home or office, a public
place, and inside a vehicle.



Transcripts are encoded in UTF-8.



IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d is distributed via
web download.



2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US $25.

*

(3) Noisy TIMIT Speech <https://catalog.ldc.upenn.edu/LDC2017S04> was
developed by the Florida Institute of Technology <http://www.fit.edu/> and
contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic
Continuous Speech Corpus (LDC93S1 <https://catalog.ldc.upenn.edu/LDC93S1>)
modified with different additive noise levels. Only the audio has been
modified; the original arrangement of the TIMIT corpus is still as
described by the TIMIT documentation.

The additive noise are white, pink, blue, red, violet and babble noise with
levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB. The color
noise types were generated artificially using MATLAB. The babble noise was
selected from a random segment of recorded babble speech scaled relative to
the power of the original TIMIT audio signal.

Noisy TIMIT Speech is distributed via web download.



2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $400.

*

(4) GALE English-Chinese Parallel Aligned Treebank -- Training
<https://catalog.ldc.upenn.edu/LDC2017T06> was developed by LDC and
contains 196,123 tokens of word aligned English and Chinese parallel text
with treebank annotations. This material was used as training data in the
DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and
syntactic structures aligned at the sentence level and the sub-sentence
level. Such data sets are useful for natural language processing and
related fields, including automatic word alignment system training and
evaluation, transfer-rule extraction, word sense disambiguation,
translation lexicon extraction and cultural heritage and cross-linguistic
studies. With respect to machine translation system development, parallel
aligned treebanks may improve system performance with enhanced syntactic
parsers, better rules and knowledge about language pairs and reduced word
error rate.

The English source data was translated into Chinese. Chinese and English
treebank annotations were performed independently. The parallel texts were
then word aligned. The material in this release corresponds to portions of
the treebanked data in OntoNotes 3.0 (LDC2009T24
<https://catalog.ldc.upenn.edu/LDC2009T24>) and OntoNotes 4.0 (LDC2011T03
<https://catalog.ldc.upenn.edu/LDC2011T03>).

This release consists of English source broadcast programming (CNN,
NBC/MSNBC) and web data collected by LDC in 2005 and 2006.

GALE English-Chinese Parallel Aligned Treebank – Training is distributed
via web download.



2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275 <(215)%20573-1275>

E: ldc@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104









-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney