Creative Brains Marketing

le C4 data contains 15.7 million sites.

Google’s C4 dataset includes your website and content. You can find out with a new search engine from the Washington Post.

Why do we care? This dataset contains the types of websites, content creators, and news outlets that generative AI may negatively impact, or even eliminate, including blogs, marketing, and news publishers.

Search. You can find the new search tool in the Post article In the secret list that makes AI like ChatGPT seem smart. The list was created “based on the number of ‘tokens,’ that each data set contained. The story explained that tokens are small pieces of text used for processing disorganized data — usually a word or phrase.

Search Engine Land is a good example.

Marketing Land Events (a brand which no longer exists but was present in 2019) hosted the SMX and MarTech conferences.

Third Door Media, the parent company of Search Engine Land.

Barry Schwartz’s Search Engine Roundtable has also been used.

@kevinschaul, and @dataviz_szuyu have done all the hardwork and created this fantastic search tool for websites. Some of us have already discovered their old blogs. Hope you’ll find the rankings as fascinating as I did https://t.co/xckLl15ZaS pic.twitter.com/7Q7zmzDC6w

— Nitasha Tiku @[email protected] (@nitashatiku)

April 19, 2023

This is only a part of data. The C4 (which is the Colossal Clean Crawled Corpus), which is used by Google Bard, and other large language model. It also uses Wikipedia and Reddit, among other sources.

Speaking about Reddit. Reddit is looking to be paid by companies who want to use their data to train AI model, according to the New York Times. Reddit’s API has been updated to terms . Some companies, such as Google and OpenAI, will be charged for access. Steve Huffman, CEO and cofounder of Reddit, said:

The Reddit data corpus is very valuable. We don’t have to give away all that value for free to some of the biggest companies in the universe. We have a problem when we crawl Reddit and generate value, but do not return any of this value to our users. “Now is the time to tighten up our operations.”

Reddit itself didn’t create any of this value. Its users did.

The post Search for 15.7 million sites in Google’s C4 dataset first appeared on Search Engine Land.

Tagged Google SEO

Leave a Reply Cancel reply

Web Design

Digital Marketing

SEO

Let's get creative and brains and well.. marketing I guess —this is a super long tagline, ok I'm done now.

Company

Support

Legal

Further information is available upon request. Web design, website maintenance, digital marketing, marketing, graphic design, branding, web development, custom programming, and SEO by 702 Pros. Visit our website sitemap for more information about content structing. The information on this website is general, and shouldn’t be used to base any decisions on your life or work. Creative Brains Marketing™ makes no representations or warranties as to the accuracy, appropriateness, completeness, methods of working, results of operations, or anything else. You use the site entirely at your own risk. Some links might lead you to content that is not accurate for the purpose(s) of which we linked. We cannot be responsible for any content you find in those pages.