BenchmarkXPRT Blog banner

Category: Benchmarks in general

Check out the new XPRTs around the world infographic!

If you’ve followed the XPRT blog for a while, you know that we occasionally update the community on some of the reach metrics we track by publishing a new version of the “XPRTs around the world” infographic. The metrics we track include completed test runs, benchmark downloads, and mentions of the XPRTs in advertisements, articles, and tech reviews. Gathering this information gives us insight into how many people are using the XPRT tools, and updating the infographic helps readers and community members see the impact the XPRTs are having around the world.

This week, we published a new infographic, which include the following highlights:

  • The XPRTs have been mentioned more than 13,900 times on over 4,000 unique sites.
  • Those mentions include more than 10,300 articles and reviews.
  • Those mentions originated in over 629 cities located in 67 countries on six continents. New cities of note include Bangalore, India; Donetsk, Ukraine; Lima, Peru; and Santiago, Chile.
  • The BenchmarkXPRT Development Community now includes 230 members from 76 companies and organizations around the world.

In addition to the growth in web mentions and community members, the XPRTs have now delivered more than 520,000 real-world results! We’re grateful for everyone who’s helped us get this far. Your participation is vital to our achieving our goal: to provide benchmark tools that are reliable, relevant, and easy to use.


Apples and pears vs. oranges and bananas

When people talk about comparing disparate things, they often say that you’re comparing apples and oranges. However, sometimes that expression doesn’t begin to describe the situation.

Recently, Justin wrote about using CrXPRT on systems running Neverware CloudReady OS. In that post, he noted that we couldn’t guarantee that using CrXPRT on CloudReady and Chrome OS systems would be a fair comparison. Not surprisingly, that prompted the question “Why not?”

Here’s the thing: It’s a fair comparison of those software stacks running on those hardware configurations. If everyone accepted that and stopped there, all would be good. However, almost inevitably, people will read more into the scores than is appropriate.

In such a comparison, we’re changing multiple variables at once. We’ve written before about the effect of the software stack on performance. CloudReady and Chrome OS are two different implementations of the Chromium OS, and it’s possible that one is more efficient than the other. If so, that would affect CrXPRT scores. At the same time, the raw performance of the two hardware configurations under test could also differ to a certain degree, which would also affect CrXPRT scores.

Here’s a metaphor: If you measure the effective force at the end of two levers and find a difference, to what do you attribute that difference? If you know the levers are the same length, you can attribute the difference to the amount of applied force. If you know the applied force is identical, you can attribute the difference to the length of the levers. If you lack both of those data points, you can’t know whether the difference is due to the length, the force, or a combination of the two.

With a benchmark, you can run multiple experiments designed to isolate variables and use the results from those experiments to look for trends. For example, we could install both CloudReady OS and Chrome OS on the same Intel-based Chromebook and compare the CrXPRT results. Because that removes hardware differences as a variable, such an experiment would offer some insight into how the two implementations compare. However, because differences in hardware can affect the performance of a given piece of software, this single data point would be of limited value. We could repeat the experiment on a variety of other Intel-based Chromebooks, and other patterns might emerge. If one of the implementations consistently scored higher, that would suggest that it was more efficient than the other, but would still not be definitively conclusive.

I hope this gives you some idea about why we are cautious about drawing conclusions when comparing results from different sets of hardware running different software stacks.


Digging deeper

From time to time, we like to revisit the fundamentals of the XPRT approach to benchmark development. Today, we’re discussing the need for testers and benchmark developers to consider the multiple factors that influence benchmark results. For every device we test, all of its hardware and software components have the potential to affect performance, and changing the configuration of those components can significantly change results.

For example, we frequently see significant performance differences between different browsers on the same system. In our recent recap of the XPRT Weekly Tech Spotlight’s first year, we highlighted an example of how testing the same device with the same benchmark can produce different results, depending on the software stack under test. In that instance, the Alienware Steam Machine entry included a WebXPRT 2015 score for each of the two browsers that consumers were likely to use. The first score (356) represented the SteamOS browser app in the SteamOS environment, and the second (441) represented the Iceweasel browser (a Firefox variant) in the Linux-based desktop environment. Including only the first score would have given readers an incomplete picture of the Steam Machine’s web-browsing capabilities, so we thought it was important to include both.

We also see performance differences between different versions of the same browser, a fact especially relevant to those who use frequently updated browsers, such as Chrome. Even benchmarks that measure the same general area of performance, for example, web browsing, are usually testing very different things.

OS updates can also have an impact on performance. Consumers might base a purchase on performance or battery life scores and end up with a device that behaves much differently when updated to a new version of Android or iOS, for example.

Other important factors in the software stack include pre-installed software, commonly referred to as bloatware, and the proliferation of apps that sap performance and battery life.

This is a much larger topic than we can cover in the blog. Let the examples we’ve mentioned remind you to think critically about, and dig deeper into, benchmark results. If we see published XPRT scores that differ significantly from our own results, our first question is always “What’s different between the two devices?” Most of the time, the answer becomes clear as we compare hardware and software from top to bottom.


Another great year

A lot of great stuff happened this year! In addition to releasing new versions of the benchmarks, videos, infographics, and white papers, we released our first-ever German UI and sponsored our first student partnership at North Carolina State University. We visited three continents to promote the XPRTs and saw XPRT results published in six of them (we’re still working on Antarctica).

Perhaps most exciting, we reached our fifth anniversary. Users have downloaded or run the XPRTs over 100,000 times.

As great as the year has been, we are sprinting into 2016. Though I can’t talk about them yet, there are some big pieces of news coming soon. Even sooner, I will be at CES next week. If you would like to talk about the XPRTs or the future of benchmarking, let me know and we’ll find a time to meet.

Whatever your holiday traditions are, I hope you are having a great holiday season. Here’s wishing you all the best in 2016!


What makes a good benchmark?

As we discussed recently, we’re working on the design document for the next version of MobileXPRT, and we’re really interested in any ideas you may have. However, we haven’t talked much about what makes for a good benchmark test.

The things we measure need to be quantifiable. A reviewer can talk about the realism of game play, or the innovative look of a game, and those are valid observations. However, it is difficult to convert those kinds of subjective impressions to numbers.

The things we measure must also be repeatable. For example, the response time for an online service may depend on the time of day, number of people using the service at the time, network load, and other factors that change over time. You can measure the responsiveness of such services, but doing so requires repeating the test enough times under enough different circumstances to get a representative sample.

The possible things we can measure go beyond the speed of the device to include things such as battery life and compatibility with standards, and even fidelity or quality such as with photos or video. BatteryXPRT and CrXPRT test battery life, while the HTML5 tests in WebXPRT are among those that test compatibility. We are currently looking into quality metrics for possible future tools.

I hope this has given you some ideas. If so, let us know!


Staying out in the open

Back in July, Anandtech publicized some research about possible benchmark optimizations in the Galaxy S4. Yesterday, Anandtech published a much more comprehensive article, “The State of Cheating in Android Benchmarks.” It’s well worth the read.

Anandtech doesn’t accuse any of the benchmarks of being biased—it’s the OEMS who are supposedly doing the optimizations. I will note that none of the XPRT benchmarks are among the whitelisted CPU tests. That being said, I imagine that everyone in the benchmark game is concerned about any implication that their benchmark could be biased.

When I was a kid, my parents taught me that it’s a lot harder to cheat in the open. This is one of the reasons we believe so strongly in the community model for software development. The source code is available to anyone who joins the community. It’s impossible to hide any biases. At the same time, it allows us to control derivative works. That’s necessary to avoid biased versions of the benchmarks being published. We think the community model strikes the right balance.

However, any time there is a system, someone will try to game it. We’ll always be on the lookout for optimizations that happen outside the benchmarks.


Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?