#llmdata hashtag - Bluesky

5 months ago

German Commons just cleared the copyright fog for massive AI datasets—think Common Pile, Hugging Face, EleutherAI and the new OpenGPT‑X/Teuken‑7B. Dive into how this opens doors for researchers and devs. #GermanCommons #llmdata #OpenGPTX

🔗 aidailypost.com/news/german-...

0 0 0 0

Hacker News Companion

@hncompanion.com

5 months ago

Practical applications of DeepSeek OCR include generating high-quality LLM training data and enhancing document accessibility. However, limitations persist, especially for non-English languages and highly specialized document types. #LLMData 6/6

0 0 0 0

Hacker News Companion

@hncompanion.com

8 months ago

The effectiveness of LLMs also hinges on training data. Languages with extensive, well-structured codebases (e.g., Python, Ruby on Rails) often yield better AI-generated results, as do strong language conventions. #LLMData 5/5

0 0 0 0

Erika Rosell

@erikarosell.bsky.social

1 year ago

"This discovery suggests that Meta strips [copyright information] not just for training purposes,” the filing reads, “but also to conceal its copyright infringement, because stripping copyrighted works … prevents Llama from outputting copyright information that might alert Llama users and the public to Meta’s infringement." Source: Mark Zuckerberg gave Meta’s Llama team the OK to train on copyrighted works, filing claims. Kyle Wiggers. 10:10 AM PST · January 9, 2025. Tech Crumch.

Here's an #AICopyright #AIEthics #LLMData case to follow!

Kadrey v. Meta

techcrunch.com/2025/01/09/m...

1 0 0 0