#CommonCorpus hashtag - Bluesky

1 month ago

#CommonCorpus goes global: @glynmoody.bsky.social champions this transparent, open #AI training set as an antidote to proprietary black boxes. Governments & publishers should fund it, as it solves copyright headaches while democratising AI development. #PublicDomain walledculture.org/common-corpu...

0 1 0 0

Carlos Solís

@csolisr.hub.azkware.net.ap.brid.gy

10 months ago

Original post on hub.azkware.net

The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on […]

0 0 0 0

Carlos Solís

@csolisr.azkware.net

1 year ago

Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social

0 0 1 0

ethicalabs.ai

@ethicalabs.bsky.social

1 year ago

They Said It Couldn’t Be Done A Blog post by Pierre-Carl Langlais on Hugging Face

Pleias is a large language model trained exclusively on open data. It was developed using the Common Corpus, a dataset that addresses the need for high-quality compliant training data in AI development. huggingface.co/blog/Pclangl...

#opensourcellm #opendata #commoncorpus #llm #ai #ml

3 0 0 0

Nate Angell

@xolotl.org

2 years ago

Common Corpus - a PleIAs Collection The largest public domain dataset for training LLMs.

happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also openly licensed works (eg, shared with @creativecommons.bsky.social licenses — which are open, but still copyrighted and not PD) huggingface.co/collections/...

0 1 0 0