le C4 data contains 15.7 million sites.

Google’s C4 dataset includes your website and content. You can find out with a new search engine from the Washington Post.

Why do we care? This dataset contains the types of websites, content creators, and news outlets that generative AI may negatively impact, or even eliminate, including blogs, marketing, and news publishers.

Search. You can find the new search tool in the Post article In the secret list that makes AI like ChatGPT seem smart. The list was created “based on the number of ‘tokens,’ that each data set contained. The story explained that tokens are small pieces of text used for processing disorganized data — usually a word or phrase.

Search Engine Land is a good example.

Marketing Land Events (a brand which no longer exists but was present in 2019) hosted the SMX and MarTech conferences.

Third Door Media, the parent company of Search Engine Land.

Barry Schwartz’s Search Engine Roundtable has also been used.

This is only a part of data. The C4 (which is the Colossal Clean Crawled Corpus), which is used by Google Bard, and other large language model. It also uses Wikipedia and Reddit, among other sources.

Speaking about Reddit. Reddit is looking to be paid by companies who want to use their data to train AI model, according to the New York Times. Reddit’s API has been updated to terms . Some companies, such as Google and OpenAI, will be charged for access. Steve Huffman, CEO and cofounder of Reddit, said:

Reddit itself didn’t create any of this value. Its users did.

The post Search for 15.7 million sites in Google’s C4 dataset first appeared on Search Engine Land.

Leave a Reply

Your email address will not be published. Required fields are marked *