Advertisement · 728 × 90

Posts by Linguistic Data Consortium

More LDC data in the LORELEI series: LORELEI Somali Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4cAwmNc

8 hours ago 1 0 0 0

MATERIAL Tagalog-English Language Pack has 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/4tabfHI

1 day ago 0 0 0 0

DEFT Chinese and English Light and Rich ERE Parallel Annotation: 179 Chinese-English discussion forum documents annotated for entities, relations and events, including coreference (light) and event hoppers (rich), developed by LDC for the DARPA DEFT program bit.ly/4edPl1m

4 days ago 0 0 0 0

Check out our April newsletter for LDC’s latest publications – DEFT Chinese and English Light and Rich ERE Parallel Annotation, MATERIAL Tagalog-English Language Pack and LORELEI Somali Representative Language Pack ldc-upenn.blogspot.com

5 days ago 0 0 0 0

CALLHOME Spanish Lexicon Second Edition: morphological, phonological, stress & frequency info for 45,547 Spanish words from transcripts of telephone speech between Spanish speakers and Spanish news text, with a pronunciation dictionary & G2P tools bit.ly/4sEIrX0

3 weeks ago 0 1 0 0

CALLHOME Spanish Second Edition brings original speech and transcript datasets up to date with new transcripts and revised directories, file formats and documentation bit.ly/4dfSehI

4 weeks ago 0 0 0 0

Ancient Chinese WordNet contains 55,100 records of words from the Pre-Qin period (before 221 BCE) linked to a corresponding synset in Princeton WordNet 1.6, covering 22 noun categories, 15 verb categories, and additional adjective and adverb categories bit.ly/3NybVa2

4 weeks ago 0 0 0 0

LDC’s March newsletter features the release of three new publications – Ancient Chinese WordNet, CALLHOME Spanish Second Edition and CALLHOME Spanish Lexicon Second Edition ldc-upenn.blogspot.com

1 month ago 1 0 0 0
Advertisement
PLC 50 logo

PLC 50 logo

The 50th Penn Linguistics Conference (PLC) is Feb 28–Mar 1. PLC brings together students, faculty & researchers interested in languages & linguistics to share new work and connect with peers. We wish everyone a great and productive conference. @pennlinguistics.bsky.social tinyurl.com/verswp3z

1 month ago 0 0 0 0

More LDC data in the LORELEI series: LORELEI Russian Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/3MHDr4v

1 month ago 1 0 0 0
Preview
International Mother Language Day 21 February

Happy International #MotherLanguageDay This year’s theme celebrates youth voices on multilingual education – emphasizing that language is central to identity, learning, well-being and participation in society. Let’s celebrate every language, every voice www.unesco.org/en/days/moth...

2 months ago 0 0 0 0

KAIROS Schema Learning Background Source Data: 14K English & Spanish multimodal resources collected by LDC for a Schema Learning Corpus; schemas were used with event extraction to characterize & make predictions about real-world events in the corpus bit.ly/4tPVeYa

2 months ago 0 0 0 0

2022 NIST Language Recognition Evaluation Test and Development Sets: 222 hours of telephone speech and broadcast narrowband speech in 14 languages, plus turnkey evaluation documentation, emphasizing African languages and related English and French dialects bit.ly/4rIEJLs

2 months ago 0 0 0 0

Catch up on 2026 membership discounts, spring data scholarship awards and the release of three new publications in LDC’s February newsletter ldc-upenn.blogspot.com

2 months ago 1 0 0 0

MATERIAL Swahili-English Language Pack has 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/49SWG3R

2 months ago 0 0 0 0

CALLHOME Japanese Lexicon Second Edition: morphological, phonological and stress information for 80,688 Japanese words from transcripts of telephone conversations between native Japanese speakers, along with a pronunciation dictionary and G2P tools bit.ly/3NlxvhC

2 months ago 0 0 0 0
Advertisement

CALLHOME Japanese Second Edition brings original speech and transcript datasets up to date with new transcripts and revised directories, file formats and documentation bit.ly/49kSdqz

3 months ago 0 0 0 0

LDC welcomes 2026 with its January newsletter featuring three publications and membership renewal information ldc-upenn.blogspot.com

3 months ago 0 0 0 0

LORELEI Sinhala Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/4iVnJP1

4 months ago 0 0 0 0

2021 NIST SRE Test Set: 447 hours of Cantonese, Mandarin, and English conversational telephone speech, audio from video, and selfie image data for development and test, along with answer keys, enrollment, trial files and documentation bit.ly/4q35JV4

4 months ago 0 0 0 0

Check out LDC’s December’s newsletter for the latest news and publications and join us in celebrating the release of our 1000th corpus! ldc-upenn.blogspot.com

4 months ago 0 0 0 0
Preview
#18.9 Interspeech 2025 Impressions - Denise Dipersio Meet Denise Dipersio Associate Director at Linguistic Data Consortium sharing her experience with us. Host: Pascal Hecker Post-production: Wei Xue

Check out ISCA-SAC’s Speech Pitch podcast to hear from LDC’s Denise DiPersio #18.9. This session was recorded during Interspeech 2025. Listen to Denise talk about LDC’s past, present and future and LDC’s involvement in Interspeech since the 2009 conference in Brighton. tinyurl.com/488rske4

4 months ago 1 0 0 0

LORELEI Ilocano Incident Language Pack: monolingual and parallel text, annotations, software tools and more for human language technology development in this under-resourced language bit.ly/43moVEw

5 months ago 1 0 0 0

AnnoDIFP CTS Audio and Transcripts: 242.52 hours of English telephone audio and transcripts from 1179 calls involving 327 participants, paired with scores from two self-reported personality assessments bit.ly/47J6JHX

5 months ago 0 0 0 0

LDC’s November newsletter has details on 2026 membership renewal, the spring data scholarship deadline and two new publications ldc-upenn.blogspot.com

5 months ago 0 0 0 0
Advertisement

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations: transcripts and English translations for 116 hours of BOLT CTS telephone recordings; all speech was transcribed; 99% of the transcripts were translated bit.ly/4ockuEo

6 months ago 0 0 0 0

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio: 116 hours of telephone speech from 274 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/42rsg4S

6 months ago 0 0 0 0

KAIROS Phase 2 Quizlet contains English and Spanish web data annotated for events, relations and arguments, a reference knowledge graph and a knowledge base; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3WqvYYR

6 months ago 0 0 0 0

See LDC’s October newsletter for a preview of 2026 publications, fall data scholarship recipients and three new publications ldc-upenn.blogspot.com

6 months ago 0 0 0 0

More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar

6 months ago 1 0 0 0