You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2016/06/15 18:05:07 UTC
Fwd: June 2016 Newsletter – LDC

Hi Folks,
June 2016 Newsletter from Linguistic Data Consortium
Thanks

---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
Date: Wed, Jun 15, 2016 at 9:21 AM
Subject: Fwd: June 2016 Newsletter – LDC
To: "lewis.mcgibbney@gmail.com" <le...@gmail.com>




Sent from my iPhone

Begin forwarded message:

*From:* Linguistic Data Consortium <ld...@ldc.upenn.edu>
*Date:* June 15, 2016 at 8:46:25 AM PDT
*To:* <ld...@ldc.upenn.edu>
*Subject:* *June 2016 Newsletter – LDC*


*In this newsletter:*

*Commercial use and LDC data*

*New publications:*

Chinese Treebank 9.0 <#m_-2462229717884595021_Treebank>



CHM150 <#m_-2462229717884595021_CHM>


<#m_-2462229717884595021_GALE>

GALE Phase 4 Arabic Weblog Parallel Sentences <#m_-2462229717884595021_GALE>







*Commercial use and LDC data*

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or
for any commercial purpose. LDC data users should consult corpus-specific
license agreements for limitations on the use of certain corpora. Visit our
Licensing <https://www.ldc.upenn.edu/data-management/using/licensing> page
for more information.





New Corpora



(1) Chinese Treebank 9.0 <https://catalog.ldc.upenn.edu/LDC2016T13>
consists of approximately two million words of annotated and parsed text
from Chinese newswire, government documents, magazine articles, various
broadcast news and broadcast conversation programs, web newsgroups,
weblogs, discussion forums, chat messages and transcribed conversational
telephone speech. This new data set in the Chinese Treebank series adds
more annotated web data and two new genres – chat messages and transcribed
telephone speech.



There are 3,726 text files in this release, containing 132,076 sentences,
2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is
provided in the UTF-8 encoding, and the annotation has Penn Treebank-style
labeled brackets. The data is provided in four different formats: raw text,
word segmented, POS-tagged, and syntactically bracketed formats. All files
were automatically verified and manually checked.



Chinese Treebank 9.0 is distributed via web download.



2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $300.



*



(2) CHM150 <https://catalog.ldc.upenn.edu/LDC2016S04> (Corpus Hecho en
México 150) was developed by the Speech Processing Laboratory
<http://odin.fi-b.unam.mx/profesores/abelherrera/> of the Faculty of
Engineering at the National Autonomous University of Mexico
<http://www.unam.mx/> (UNAM) and consists of approximately 1.63 hours of
Mexican Spanish speech, associated transcripts, and speaker metadata. The
goal of this work was to support spoken term detection and forensic speaker
identification.



This corpus is comprised of Mexican Spanish microphone speech from 75 male
speakers and 75 female speakers in a quiet office environment. Speakers
could answer pre-selected open questions or describe a particular painting
shown to them on a computer monitor. Speaker metadata in this release
includes age, gender, place of birth, place of residence and parents'
nationalities.



CHM150 is distributed via web download.



2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. This data is being made available at no-cost for
non-member organizations under a research license
<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/chm150-user-agreement.pdf>.




*



(3) GALE Phase 4 Arabic Weblog Parallel Sentences
<https://catalog.ldc.upenn.edu/LDC2016T14> was developed by LDC. Along with
other corpora, the parallel text in this release comprised training data
for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and
corresponding English translations, selected from newsgroup and weblog data
collected by LDC and translated by LDC or under its direction.



The data includes 1,067 source-translation document pairs, comprising
68,346 words (Arabic source) of translated data.



Sentences were selected for translation in two steps. First, files were
chosen using sentence selection scripts provided by GALE program
participants SRI International <http://www.sri.com/> and IBM
<http://www.ibm.com/us/en/>. The output was then manually reviewed by LDC
staff to eliminate problematic sentences. Selected files were reformatted
into a human-readable translation format and assigned to translation
vendors. Translators followed LDC's Chinese to English translation
guidelines and were provided with the full source documents containing the
target sentences for their reference. Bilingual LDC staff performed quality
control procedures on the completed translations.



GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web
download.



2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.



-- 
Membership Office
Linguistic Data Consortium
University of Pennsylvania
3600 Market St. Suite 810
Philadelphia, PA 19130
Tel: 215-573-1275email:ldc@ldc.upenn.edu
Fax: 215-573-2175




-- 
*Lewis*