Google’s Deal With Stack Overflow Is the Latest Proof That AI Giants Will Pay for Data

Google’s Deal With Stack Overflow Is the Latest Proof That AI Giants Will Pay for Data

Last year Stack Overflow became one of the first websites to announce it would charge AI giants for access to content used to train chatbots. Now the popular Q&A service for coders has signed up its first customer—Google—in what CEO Prashanth Chandrasekar says is the start of a “meaningful” new stream of revenue.

The deal is significant, because it remains unclear how broadly Google and other AI developers will pay for content needed for AI projects. Millions of books and websites have fueled the development of AI systems, but most publishers have not been compensated, and some are suing over what they allege is misuse. Many publishers, including Stack Overflow, appear threatened by ChatGPT and other generative AI products, which can answer queries that would have previously sent coders their way.

The deal will see Google’s cloud division use questions and answers from Stack Overflow about Google Cloud services to provide coding assistance and technical support through a version of Google’s Gemini chatbot. Google’s cloud computing customers will also be able to ask questions through Google Cloud’s command-line interface. “Their AI may not have all the answers, and so we have a huge ability to help complete that loop,” Chandrasekar says. “We are the biggest place where community knowledge is curated and validated.”

Gemini will summarize answers drawn from Stack Overflow in its own words but include the company’s logo, a link back to the original material, and the username of the site contributor who supplied it. The companies plan to demonstrate the system at Google Cloud Next, the search company’s annual cloud conference in April, and launch it soon after.

Chandrasekar says there are no significant restrictions on how Google Cloud can use Stack Overflow data, meaning it can be used to train large language models and other AI systems. “Where we want to stand firm on is—nonnegotiable things for us— trust, accuracy, quality, and attribution back to the sources of these AI outputs,” he says.

He declined to say how much Stack Overflow is being paid by Google for the data. “This will be a meaningful commercial offering for us in the near term, medium term, and long term,” Chandrasekar says.

Covert Scraping

Google and other AI developers have previously gathered data from Stack Overflow and other websites without much notice. As demand for generative AI technologies has surged—and the valuations of the companies developing them has rocketed—the websites supplying the foundational text have begun demanding what they view as their fair share. Fortunately for Stack Overflow, prospective customers have heeded the message, Chandrasekar says. “We’re not having to chase people,” he says.

Stack Overflow data is particularly beneficial to AI systems that generate computer code, which have proven to be popular with software engineers and a significant source of revenue for Microsoft and OpenAI.

The new Stack Overflow deal comes just a week after Google reached a licensing agreement to hoover up data from Reddit, the discussion forums operator, whose content has helped chatbots’ ability to converse. Reddit had unveiled plans to start charging for data access just before Stack Overflow had last year.

Stack Overflow’s fees for what it’s calling OverFlowAPI vary based on the type of data provided. Beyond its basic repository of 59 million questions and answers, the site charges more for layers of metadata such as post categories and voting history of user-submitted answers, trends about the types of questions being asked, and bespoke cuts of information, perhaps questions about a specific coding language, to help with fine-tuning. “It’s more about what level of the data they have access to,” Chandrasekar says. “It’s less about the number of times they request data.”

He says internal testing shows the value Stack Overflow data can have. When they tuned open-source language models from Meta and AI startup Mistral with Stack Overflow data, the accuracy of responses to technical questions increased by 20 percentage points, he says.

The Google deal will also test how users of the version of Gemini for Google Cloud integration can create new data for Stack Overflow. People who don’t get a satisfactory response from the chatbot will be able to submit their query to Stack Overflow, where once approved by moderators it will be available for the website’s community of users to answer. As the companies prepare for the demo in April, they also are talking about letting users submit improved answers back to Stack Overflow.

https://www.wired.com/feed/rss

Paresh Dave

Leave a Reply