BenchmarkXPRT Blog banner

Category: Benchmarking

The real art of benchmarking

In my last blog entry, I noted the challenge of balancing real-world and real-science considerations when benchmarking Web page loads. That issue, however, is inherent in all benchmarking. Real world argues for benchmarks that emphasize what users and computers actually do. For servers, that might mean something like executing real database transactions against a real database from real client computers. For tablets, that might mean real fingers selecting and displaying real photos. There are obvious issues with both—setting up such a real database environment is difficult and who wants to be the owner of the real fingers driving the tablet? It is also difficult to understand what causes performance differences—is it the network, the processors, or the disks in the server? There are also more subtle challenges, such as how to make the tests work on servers or tablets other than the original ones. Worse, such real-world environments are subject to all sorts of repeatability and reproducibility issues.

Real science, on the other hand, argues for benchmarks that emphasize repeatable and reproducible results. Further, real science wants benchmarks that isolate the causes of performance differences. For servers, that might mean a suite of tests targeting processor speed, network bandwidth, and disk transfer rate. For tablets, that might mean tests targeting processor speed, touch responsiveness, and graphics-rendering rate. The problem is that it is not always obvious what combination of such factors actually delivers better database server performance or tablet experience. Worse, it is possible that testing different databases and transactions would result in very different characteristics that these tests don’t at all measure.

The good news is that real world and real science are not always in opposition. The bad news is that a third factor exacerbates the situation—benchmarks take real time (and of course real money) to develop. That means benchmark developers need to make compromises if they want to bring tests to market before the real world they are attempting to measure has changed. And, they need to avoid some of the most difficult technical hurdles. Like most things, that means trying to find the right balance between real world and real science.

Unfortunately, there is no formula for determining that balance. Instead, it really is somewhat of an art. I’d love to hear from you some examples of benchmarks (current or from the past) that you think do a good job implementing this balance and showing the real art of benchmarking.

Bill

Comment on this post in the forums

Web benchmarking challenges

I think that an important part of any touch benchmark will be a Web component. After all, the always (or almost always) connected nature ofthese devices is a critical part of their identities. I think such a Web benchmark needs to include a measurement of page load speed (how long it takes to download and render a page).

Creating such a test seems straightforward. Pick a set of sites, such as the five or ten most popular, and then time how long the home page of each takes to load. The problem, however, is that those pages are constantly changing. Every few months, most popular sites do a major redesign. That would obviously affect the results for a test and make it difficult to compare the results of a current test to one from a few months back. It is even more of a problem that the page will be different for one user than another as sites typically know things like where you are and what your computer is and adjust things to match those characteristics. And, the ads and the content of the site are constantly changing and updating. Even hitting Refresh on a page can give you different page.

Given all of those problems, how is it possible to test page loads? One way is to create pages that are similar those of leading Web sites in terms of things like size, amount of graphics, and dynamic elements. This allows the tests to be consistent over time and from different devices and locations. (Or, at least, as consistent as the variability of the Internet from moment to moment allows.) The problem with this approach, however, is that the pages will age out as Web sites update themselves and they will not be the real sites.

Such are the tradeoffs in benchmarking. The key is how to balance real science with real world considerations. What do you think? Which approach is the better balance of real science and real world?

Bill

Comment on this post in the forums

An open, top-down process

We’ve been hard at work putting together the RFC for HDXPRT 2012. As a group of us sat around a table discussing what we’d like to see in the benchmark, it became clear to me how different this development process is from those of other benchmarks I’ve had a hand in creating (3D WinBench, Winstone, WebBench, NetBench, and many others.). The big difference is not in the design or the coding or even the final product.

The difference is the process.

A sentiment that came up frequently in our meeting was “Sure, but we need to see what the community thinks.” That indicates a very different process than I am used to. Different from what companies developing benchmarks do and different from what benchmark committees do. What it represents, in a word, is openness. We want to include the Development Community in every step of the process, and we want to figure out how to make the process even more open over time. For example, we discussed ideas as radical as videoing our brainstorming sessions.

Another part of the process I think is important is that we are trying to do things top-down. Rather than deciding which applications should be in the benchmark, we want to start by asking how people really use high-definition media. What do people typically do with video? What do they do to create it and how do they watch it? Similarly, what do people do with images and audio?

At least as importantly, we don’t want to include only our opinions and research on these questions; we want to pick your brains and get your input. From there, we will work on the workflows, the applications, and the RFC. Ultimately, that will lead to the scripts themselves. With your input and help, of course!

Please let us know any ideas you have for how to make the process even more open. And tell us what you think about this top-down approach. We’re excited and hope you are, too!

Bill

Comment on this post in the forums

Getting to the source

Many of the earliest benchmarks came in source code form. Dhrystone and many others relied on the compiler for optimization. In fact, some compilers even recognized the code and basically optimized it to a few lines of code that did nothing but return the result! Even some modern benchmarks, such as SPEC CPU and LINPACK, come in source code form.

The source code to application benchmarks, however, has not typically been available. Two of the leading benchmarks of the last twenty years, Winstone and SYSmark, were never available in source code form. The makers of those tools had good reasons for keeping the code private; we know, because led the creation of Winstone. Keeping code private protects your intellectual investment, can make it easier to hit development schedules, and provides many other advantages.

It also, however, can lead some people to criticize that the reason you’re not showing the source code is that it is in some way biased. In benchmarks as in so many areas, transparency is the best way to allay such concerns.

Which leads us to today’s big announcement

We want HDXPRT to be as open as possible, so we’re bucking the normal practice for application-based benchmarks and planning to make the HDXPRT 2011 source code available to the HDXPRT Development Community.

The code will include both the benchmark harness and the scripts that drive the applications. You’ll be able to study everything about the benchmark. You’ll also be able to more easily contribute new code. Which is exactly what we hope you’ll do. We want you not only to be completely comfortable with the benchmark, we want you to contribute to future versions of it.

There will, of course, be some ground rules. We are making the code available only to the HDXPRT Development Community. (If you’re not already a member, joining is cheap and easy: just go here.) Because we want to limit the code to the community, to get access to it, members will have to agree to a license agreement that prevents them from releasing it to the public.

We don’t have an exact schedule in place yet, but over the next week or two, we should have all the necessary things in place to make the source code available.

When you’ve had a chance to look at it, please let us know what improvements you would like to see in HDXPRT 2012. We’ll discuss that version, and how you can help, in the coming weeks.

Bill

Comment on this post in the forums

Keeping score

One question I received as a result of the last two blog entries on benchmark anatomy was whether I was going to talk about the results or scores.  That topic seemed like a natural follow up.

All benchmarks need to provide some sort of metric to let you know how well the system under test (SUT) did.  I think the best metrics are the easily understood ones.  These metrics have units like time or watts.  The problem with some of these units is that sometimes smaller can be better.  For example, less time to complete a task is better.  (Of course, more time before the battery runs down is better!)  People generally see bigger bars in a chart as better.

Some tests, however, give units that are not so understandable.  Units like instructions per second, requests per second, or frames per second are tougher to relate to.  Sure, more bytes per second would be better, but it is not as easy to understand what that means in the real world.

There is a solution to both the problem of smaller is better and non-intuitive units—normalization.  With normalization, you take the result of the SUT and divide it by that of a defined base or calibration system.  The result is a unit-less number.  So, if the base system can do 100 blips a second and the SUT can do 143 blips a second, the SUT would get 143 / 100 or a score of 1.43.  The units cancel out in the math and what is left is a score.  For appearance or convenience, the score may be multiplied by some number like 10 or 100 to make the SUT’s score 14.3 or 143.

The nice thing about such scores is that it is easy to see how much faster one system is than another.  If you are measuring normalized execution time, a score of 286 means a system is twice as fast as one of 143.  As a bonus, bigger numbers are better.  An added benefit is that it is much easier to combine multiple normalized results into a single score.  These benefits are the reason that many modern benchmarks use normalized scores.

There is another kind of score, which is more of a rating.  These scores, such as a number of stars or thumbs up, are good for relative ratings.  However, they are not necessarily linear.  Four thumbs up is better than two, but is not necessarily twice as good.

Next week, we’ll look closer at the results HDXPRT 2011 provides and maybe even venture into the difference between arithmetic, geometric, and harmonic means!  (I know I can’t wait.)

Bill

Comment on this post in the forums

Anatomy of a benchmark, part II

As we discussed last week, benchmarks (including HDXPRT 2011) are made up of a set of common major components. Last week’s components included the Installer, User Interface (UI), and Results Viewer.  This week, we’ll look more at the guts of a benchmark—the parts that actually do the performance testing.

Once the UI gets the necessary commands and parameters from the user, the Test Harness takes over.  This part is the logic that runs the individual Tests or Workloads using the parameters you specified.  For application-based benchmarks, the harness is particularly critical, because it has to deal with running real applications.  (Simpler benchmarks may mix the harness and test code in a single program.)

The next component consists of the Tests or Workloads themselves.  Some folks use those terms interchangeably, but I try to avoid that practice.  I tend to think of tests as specially crafted code designed to gauge some aspect of a system’s performance, while workloads consist of a set of actions that an application must take as well as the necessary data for those actions.  In HDXPRT 2011, each workload is a set of data (such as photos) and actions (e.g., manipulations of those photos) that an application (e.g., Photoshop Elements) performs.  Application-based benchmarks, such as HDXPRT 2011, typically use some other program or technology to pass commands to the applications.  HDXPRT uses a combination of AutoIT and C code to drive the applications.

When the Harness finishes running the tests or workloads, it collects the results.  It then passes those results either to the Results Viewer or writes them to a file for viewing in Excel or some other program.

As we look to improve HDXPRT for next year, what improvements would you like to see in each of those areas?

Bill

Comment on this post in the forums

Check out the other XPRTs:

Forgot your password?