#CommonCorpus goes global: @glynmoody.bsky.social champions this transparent, open #AI training set as an antidote to proprietary black boxes. Governments & publishers should fund it, as it solves copyright headaches while democratising AI development. #PublicDomain walledculture.org/common-corpu...
The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on […]
Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social
Pleias is a large language model trained exclusively on open data. It was developed using the Common Corpus, a dataset that addresses the need for high-quality compliant training data in AI development. huggingface.co/blog/Pclangl...
#opensourcellm #opendata #commoncorpus #llm #ai #ml
happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also openly licensed works (eg, shared with @creativecommons.bsky.social licenses — which are open, but still copyrighted and not PD) huggingface.co/collections/...