Categories: Technology

How We Test AI – CNET

Ready or not, gen AI is here, and it’s in your hands. ChatGPT took the world by storm and remains popular despite competition from heavy hitters such as Google, Samsung and Meta. AI tools are being built into web browsers including Microsoft Bing, phones such as the Galaxy S24 and even cars, including the VW Golf. If there’s a task you want done, chances are there’s an AI assistant to help.

And now there are CNET reviews to help you decide which AI to use and what to expect. Our editors are testing AI chatbots, image generators and other AI hands-on to figure out their strengths and weaknesses. Our goal: to help guide you as you decide which will work best for you.

To perform the testing, we use the generative AI chatbots, photo generators and other AI tools we’re reviewing, just as we use a phone to review it. But the reviews themselves, like CNET’s other hands-on reviews, are written by our human team of in-house experts. For more, check out CNET’s AI policy.

How CNET reviews AI products and services

Current reviews of AI products and services on CNET are broken down into the following categories. As our reviews evolve, we plan to add more.

Generative AI chatbots [ChatGPT, Google Gemini, Perplexity]
Text-to-image generators [Dall-E 3, Google ImageFX, Adobe Firefly]
Special-purpose AI tools [Otter AI, Grammarly AI]

No matter the tool or service, our reviews try to answer the same basic question: How good is it relative to the competition and which purposes does it serve best? In any CNET review, we’ll report key information that you’ll need to know, including:

Pros: We list a handful of things we like about the AI.
Cons: We also list at least one thing we wish the AI did better.
Price: If there’s a paid version, how much does it cost? If there’s both a free and paid version, what’s the difference?
Privacy: What is the privacy policy?
Access requirements: Is the AI based in a web browser, an app or a specific device? Do you need to enter your email address, create an account or subscribe?

We score each AI we review on a scale of 1 to 10, with 10 being the best. We consider factors such as accuracy, creativity of responses, number of hallucinations, and response speed. This rating is based on our reviewer’s first-hand experience using the test methodology outlined below.

ChatGPT is one of the chatbots CNET reviewed.

James Martin/CNET

How CNET tests AI chatbots

As “everything engines,” gen AI tools like ChatGPT don’t lend themselves to many quantitative, labs-based tests, like battery life for phones or brightness for TVs. Instead, our evaluations are largely based on hands-on experience during the test phase, during which our reviewers ask questions and set tasks before the AI and judge the responses and process.

Our evaluations aim to answer the following questions:

How quick and easy is it to get a useful response using plain language?
Does it require a lot of clarifying prompts?
How helpful, unique or creative is the response?
How does the response compare to search engines and other AI chatbots?
Is the response generally correct or obviously a hallucination?
Does the AI have a “personality,” and if so, is it helpful and engaging?

Beyond getting a general sense of what it’s like to use the AI, we also test specific tasks and use cases. To account for accuracy or hallucinations, we spot-check facts and report any erroneous information we find. Our reviewers make sure to test the chatbots on topics that they personally know well. For example, one reviewer asked ChatGPT to suggest a recipe for chicken tikka masala — a dish he knows well from cooking and eating it over many years.

Test prompts may include, but aren’t limited to:

General writing tasks, including emails
Summarizing articles or other long text
Education, including research and citation
Job seeking, including resumes and cover letters
Travel or event planning
Recipe creation and modification
Buying advice

In the reviews, we report on specific prompts (what we input) and responses (what the AI outputs), but we also want to keep our tests relatively open-ended, evolve our methodology over time and prevent the AI from “learning” how we test it. For that reason we’re not listing specific prompts here.

Our review of imaging AI Dall-E 3 includes a fire-breathing dragon flying over a castle with a fluffy sheep clutched in its talons.

Stephen Shankland/CNET

How CNET tests AI text-to-image generators

Generative AI services can also take your written descriptions and use them to create images. As with chatbots, our reviews of these services are largely subjective and based on the reviewer’s hands-on experience. Our evaluations of AI text-to-image generators evaluations aim to answer the following questions:

How well do images match prompts?
How engaging are the images?
How well can you fine-tune results to get the image you want?
How swiftly do results arrive?
How does the AI handle long descriptions with multiple objects?
Are there features to specify image attributes like aspect ratio or artistic style?
Are there distortions or other problems that make images look fake?

As with our testing of chatbots, test prompts will be varied but might include things like:

Asking to render in a particular style (photorealistic, cartoonish, pixellated, and so on)
Combining two or more elements together in a single image
Specifying where elements are placed relative to one another
A lengthier description with numerous criteria
Fanciful descriptions to test creativity, such as a lightbulb made of spaghetti

How CNET tests miscellaneous AI tools

For AI tools that are neither chatbots nor text-to-image generators, our testing will be tailored to suit the tool. We’ll strive to determine how good the AI is at performing the task it promises to assist, and to call out how beneficial, or not, AI is in helping complete that task.

A review of Otter AI, an audio transcription and note-taking service, focuses on how well the features like gen AI chat and automatic meeting summaries work compared to conventional methods. Our review of Grammarly, a service designed to assist writers, evaluates how well it responds to prompts and whether its AI-suggested revisions like “shorten it” and “improve it” actually help the process.

What don’t we test?

We can’t test everything, and we don’t try to. There are plenty of areas that lie outside the scope of our current AI tests. They include:

Resistance to abuse: We don’t perform tests designed to cause AIs to deliver illegal, harmful, abusive, discriminatory, biased or copyrighted information.

Current events: Because AIs are trained on large sets of data that aren’t necessarily recent, we don’t quiz all chatbots and other assistants on recent “in the news” events.

Outcomes for AI recommendations: As part of our reviews process, we don’t commit to evaluating all the AI’s responses and suggestions in depth. We don’t cook and taste-test recipes, for example, nor can we take the trips suggested by an itinerary.

Multiple answers: In general we rely on the first reply provided by an AI for our reviews because that’s how most people typically behave. In some instances, we might run the same query multiple times to compare the results, but that’s not the norm.

AI is evolving and so are our reviews

Generative AI is still a new consumer product, so think of these reviews as version 1.0. In the last year, AI chatbots and other tools have evolved significantly, more options have entered the market and numerous models, sets of training data and AI-driven devices have debuted. We expect that evolution to continue and our AI reviews to grow and expand as well. As AI becomes more familiar and ingrained in our lives, humans at CNET will explain, review and rate them for other humans’ benefit.

https://www.cnet.com/rss/all/

CNET staff

April 2, 2024 2:00 pm

CNET staff