BenchmarkXPRT Blog banner

Category: Benchmark metrics

It’s all in the presentation

The comment period for BatteryXPRT CP2 ended on Monday. Now we are in the final sprint to release the benchmark.

The extensive testing we’ve been doing has meant that we’ve been staring at a lot of numbers. This has led us to make a change in how we present the results. As you would expect, the battery life when you’re running the test using Wi-Fi is different than when you’re running it using your cellular network. Although individual devices vary, the difference is in the vicinity of 10 percent, about the same as the difference between Airplane mode and using Wi-Fi.

BatteryXPRT has always captured a device’s Wi-Fi setting in its disclosure results, but had not included this information with the results. Because we found it so helpful to have the Wi-Fi setting alongside the results, we have changed the presentation of the results to recognize three modes: Airplane, Wi-Fi, and Cellular. We hope that this will avoid confusion as people are using BatteryXPRT.

Note that we have not changed the way the results are calculated. Results you generated during the preview are still valid. However, results from one mode should not be compared to results from another mode.

We’ve been talking a lot about BatteryXPRT, but TouchXPRT is also looking great! We’re looking forward to releasing both of them soon!

Eric

Comment on this post in the forums

Staying out in the open

Back in July, Anandtech publicized some research about possible benchmark optimizations in the Galaxy S4. Yesterday, Anandtech published a much more comprehensive article, “The State of Cheating in Android Benchmarks.” It’s well worth the read.

Anandtech doesn’t accuse any of the benchmarks of being biased—it’s the OEMS who are supposedly doing the optimizations. I will note that none of the XPRT benchmarks are among the whitelisted CPU tests. That being said, I imagine that everyone in the benchmark game is concerned about any implication that their benchmark could be biased.

When I was a kid, my parents taught me that it’s a lot harder to cheat in the open. This is one of the reasons we believe so strongly in the community model for software development. The source code is available to anyone who joins the community. It’s impossible to hide any biases. At the same time, it allows us to control derivative works. That’s necessary to avoid biased versions of the benchmarks being published. We think the community model strikes the right balance.

However, any time there is a system, someone will try to game it. We’ll always be on the lookout for optimizations that happen outside the benchmarks.

Eric

Comment on this post in the forums

Lies, damned lies, and statistics

No one knows who first said “lies, damned lies, and statistics,” but it’s easy to understand why they said it. It’s no surprise that the bestselling statistics book in history is titled How to Lie with Statistics. While the title is facetious, it is certainly true that statistics can be confusing—consider the word “average,” which can refer to the mean, median, or mode. “Mean average,” in turn, can refer to the arithmetic mean, the geometric mean, or the harmonic mean. It’s enough to make a non-statistician’s head spin.

In fact, a number of people have been confused by the confidence interval WebXPRT reports. We believe that the best way to stand behind your results is to be completely open about how you crunch the numbers. To this end, we released the white paper WebXPRT 2013 results calculation and confidence interval this past Monday.

This white paper, which does not require a background in mathematics, explains what the WebXPRT confidence interval is and how it differs from the benchmark variability we sometimes talk about. The paper also gives an overview of the statistical and mathematical techniques WebXPRT uses to translate the raw timing numbers into results.

Because sometimes the devil is in the details, we wanted to augment our overview by showing exactly how WebXPRT calculates results. The white paper is accompanied by a spreadsheet that reproduces the calculations WebXPRT uses. If you are mathematically inclined and would like to suggest improvements to the process, by all means let us know!

Eric

Comment on this post in the forums

Keep them coming!

Questions and comments have continued to come in since the Webinar last week. Here are a few of them:

  • How long are results valid? For a reviewer like us, we need to know that we can reuse results for a reasonable length of time. There is a tension between keeping results stable and keeping the benchmark current enough for the results to be relevant. Historically, HDXPRT allowed at least a year between releases. Based on the feedback we’ve received, a year seems like a reasonable length of time.
  • Is HDXPRT command line operable? (asked by a community member with a scripted suite of tests) HDXPRT 2012 is not, but we will consider adding a command line interface for HDXPRT 2013. While most casual users don’t need a command line interface, it could be very valuable to those of us using HDXPRT in labs.
  • I would be hesitant to overemphasize the running time of HDXPRT. The more applications it runs, the more it can differentiate things and the more interesting it is to those of us who run it at a professional level. If I could say “This gives a complete overview of the performance of this system,” that would actually save time. This comment was a surprise, given the amount of feedback we received saying that HDXPRT was too large. However, this gets to the heart of why we all need to be careful as we consider which applications to include in HDXPRT 2013.

If you had to miss the Webinar, it’s available at the BenchmarkXPRT 2013 Webinars page.

We’re planning to release the HDXPRT 2013 RFC next week. We’re looking forward to your comments.

Eric

Comment on this post in the forums

TouchXPRT in the fast lane

I titled last week’s blog “Putting the TouchXPRT pedal to the metal.” The metaphor still applies. On Monday, we released TouchXPRT 2013 Community Preview 1 (CP1).  Members can download it here.

CP1 contains five scenarios based on our research and community feedback. The scenarios are Beautify Photo Album, Prepare Photos for Sharing, Convert Videos for Sharing, Export Podcast to MP3, and Create Slideshow from Photos.

Each scenario gives two types of results. There’s a rate, which allows for simple “bigger is better” comparisons. CP1 also gives the elapsed time for each scenario, which is easier to grasp intuitively. Each approach has its advantages. We’d like to get your feedback on whether you’d like us to pick one of those metrics for the final version of TouchXPRT 2013 or whether it makes more sense to include both. You’ll find a fuller description of the scenarios and the results in the TouchXPRT 2013 Community Preview 1 Design overview.

While you’re looking at CP1, we’re getting the source ready to release.  To check out the source, you’ll need a system running Windows 8, with Visual Studio 2012 installed. We hope to release it on Friday. Keep your eye the TouchXPRT forums for more details.

Post your feedback to the TouchXPRT forum, or e-mail it to TouchXPRTSupport@principledtechnologies.com.  Do you want more scenarios? Different metrics? A new UI feature? Let us know! Make TouchXPRT the benchmark you want it to be.

As I explained last week, we released CP1 without any restrictions on publishing results. It seems that AnandTech was the first to take advantage of that. Read AnandTech’s Microsoft Surface Review to see TouchXPRT in action.

We are hoping that other folks take advantage of CP1’s capability to act as a cross-platform benchmark on the new class of Windows 8 devices. Come join us in the fast lane!

Bill

Comment on this post in the forums

Keeping score

One question I received as a result of the last two blog entries on benchmark anatomy was whether I was going to talk about the results or scores.  That topic seemed like a natural follow up.

All benchmarks need to provide some sort of metric to let you know how well the system under test (SUT) did.  I think the best metrics are the easily understood ones.  These metrics have units like time or watts.  The problem with some of these units is that sometimes smaller can be better.  For example, less time to complete a task is better.  (Of course, more time before the battery runs down is better!)  People generally see bigger bars in a chart as better.

Some tests, however, give units that are not so understandable.  Units like instructions per second, requests per second, or frames per second are tougher to relate to.  Sure, more bytes per second would be better, but it is not as easy to understand what that means in the real world.

There is a solution to both the problem of smaller is better and non-intuitive units—normalization.  With normalization, you take the result of the SUT and divide it by that of a defined base or calibration system.  The result is a unit-less number.  So, if the base system can do 100 blips a second and the SUT can do 143 blips a second, the SUT would get 143 / 100 or a score of 1.43.  The units cancel out in the math and what is left is a score.  For appearance or convenience, the score may be multiplied by some number like 10 or 100 to make the SUT’s score 14.3 or 143.

The nice thing about such scores is that it is easy to see how much faster one system is than another.  If you are measuring normalized execution time, a score of 286 means a system is twice as fast as one of 143.  As a bonus, bigger numbers are better.  An added benefit is that it is much easier to combine multiple normalized results into a single score.  These benefits are the reason that many modern benchmarks use normalized scores.

There is another kind of score, which is more of a rating.  These scores, such as a number of stars or thumbs up, are good for relative ratings.  However, they are not necessarily linear.  Four thumbs up is better than two, but is not necessarily twice as good.

Next week, we’ll look closer at the results HDXPRT 2011 provides and maybe even venture into the difference between arithmetic, geometric, and harmonic means!  (I know I can’t wait.)

Bill

Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?