Category: Benchmarking

Getting to the source

on September 16, 2011

Many of the earliest benchmarks came in source code form. Dhrystone and many others relied on the compiler for optimization. In fact, some compilers even recognized the code and basically optimized it to a few lines of code that did nothing but return the result! Even some modern benchmarks, such as SPEC CPU and LINPACK, come in source code form.

The source code to application benchmarks, however, has not typically been available. Two of the leading benchmarks of the last twenty years, Winstone and SYSmark, were never available in source code form. The makers of those tools had good reasons for keeping the code private; we know, because led the creation of Winstone. Keeping code private protects your intellectual investment, can make it easier to hit development schedules, and provides many other advantages.

It also, however, can lead some people to criticize that the reason you’re not showing the source code is that it is in some way biased. In benchmarks as in so many areas, transparency is the best way to allay such concerns.

Which leads us to today’s big announcement

We want HDXPRT to be as open as possible, so we’re bucking the normal practice for application-based benchmarks and planning to make the HDXPRT 2011 source code available to the HDXPRT Development Community.

The code will include both the benchmark harness and the scripts that drive the applications. You’ll be able to study everything about the benchmark. You’ll also be able to more easily contribute new code. Which is exactly what we hope you’ll do. We want you not only to be completely comfortable with the benchmark, we want you to contribute to future versions of it.

There will, of course, be some ground rules. We are making the code available only to the HDXPRT Development Community. (If you’re not already a member, joining is cheap and easy: just go here.) Because we want to limit the code to the community, to get access to it, members will have to agree to a license agreement that prevents them from releasing it to the public.

We don’t have an exact schedule in place yet, but over the next week or two, we should have all the necessary things in place to make the source code available.

When you’ve had a chance to look at it, please let us know what improvements you would like to see in HDXPRT 2012. We’ll discuss that version, and how you can help, in the coming weeks.

Bill

Comment on this post in the forums

Posted in Benchmarks in general, HDXPRT, HDXPRT Development Community benefits, HDXPRT source code |

Keeping score

By Bill Catchings

on August 17, 2011

One question I received as a result of the last two blog entries on benchmark anatomy was whether I was going to talk about the results or scores. That topic seemed like a natural follow up.

All benchmarks need to provide some sort of metric to let you know how well the system under test (SUT) did. I think the best metrics are the easily understood ones. These metrics have units like time or watts. The problem with some of these units is that sometimes smaller can be better. For example, less time to complete a task is better. (Of course, more time before the battery runs down is better!) People generally see bigger bars in a chart as better.

Some tests, however, give units that are not so understandable. Units like instructions per second, requests per second, or frames per second are tougher to relate to. Sure, more bytes per second would be better, but it is not as easy to understand what that means in the real world.

There is a solution to both the problem of smaller is better and non-intuitive units—normalization. With normalization, you take the result of the SUT and divide it by that of a defined base or calibration system. The result is a unit-less number. So, if the base system can do 100 blips a second and the SUT can do 143 blips a second, the SUT would get 143 / 100 or a score of 1.43. The units cancel out in the math and what is left is a score. For appearance or convenience, the score may be multiplied by some number like 10 or 100 to make the SUT’s score 14.3 or 143.

The nice thing about such scores is that it is easy to see how much faster one system is than another. If you are measuring normalized execution time, a score of 286 means a system is twice as fast as one of 143. As a bonus, bigger numbers are better. An added benefit is that it is much easier to combine multiple normalized results into a single score. These benefits are the reason that many modern benchmarks use normalized scores.

There is another kind of score, which is more of a rating. These scores, such as a number of stars or thumbs up, are good for relative ratings. However, they are not necessarily linear. Four thumbs up is better than two, but is not necessarily twice as good.

Next week, we’ll look closer at the results HDXPRT 2011 provides and maybe even venture into the difference between arithmetic, geometric, and harmonic means! (I know I can’t wait.)

Bill

Comment on this post in the forums

Posted in Benchmark metrics |

Anatomy of a benchmark, part II

By Bill Catchings

on August 10, 2011

As we discussed last week, benchmarks (including HDXPRT 2011) are made up of a set of common major components. Last week’s components included the Installer, User Interface (UI), and Results Viewer. This week, we’ll look more at the guts of a benchmark—the parts that actually do the performance testing.

Once the UI gets the necessary commands and parameters from the user, the Test Harness takes over. This part is the logic that runs the individual Tests or Workloads using the parameters you specified. For application-based benchmarks, the harness is particularly critical, because it has to deal with running real applications. (Simpler benchmarks may mix the harness and test code in a single program.)

The next component consists of the Tests or Workloads themselves. Some folks use those terms interchangeably, but I try to avoid that practice. I tend to think of tests as specially crafted code designed to gauge some aspect of a system’s performance, while workloads consist of a set of actions that an application must take as well as the necessary data for those actions. In HDXPRT 2011, each workload is a set of data (such as photos) and actions (e.g., manipulations of those photos) that an application (e.g., Photoshop Elements) performs. Application-based benchmarks, such as HDXPRT 2011, typically use some other program or technology to pass commands to the applications. HDXPRT uses a combination of AutoIT and C code to drive the applications.

When the Harness finishes running the tests or workloads, it collects the results. It then passes those results either to the Results Viewer or writes them to a file for viewing in Excel or some other program.

As we look to improve HDXPRT for next year, what improvements would you like to see in each of those areas?

Bill

Comment on this post in the forums

Posted in Benchmarks in general, HDXPRT workloads, Let us know your thoughts |

Anatomy of a benchmark, part I

By Bill Catchings

on August 3, 2011

Over many years of dealing with benchmarks, I’ve found that there are a few major components that HDXPRT 2011 and most others include. Some of these components are not what you might think of as part of a benchmark, but they are essential to making one both easy to use and capable of producing reproducible results. We’ll look at those parts this week and the rest next week.

The first piece that you encounter when you use a benchmark is its Installation program. Simple benchmarks may forgo an installation component and just let you copy the files, including any executables, into a directory. By contrast, HDXPRT 2011, like other application-based benchmarks, takes great pains to install the necessary applications. It even has to check to see which of them are already installed on the computer under test and cope with those it finds.

Once the benchmark is on the system, you launch it and encounter the User Interface (UI). For some benchmarks, the UI may be only a command-line interface with a set of switches or options. HDXPRT 2011, in keeping with its emphasis on an HD user experience, includes a graphical UI that lets you run its tests.

Many benchmarks, including HDXPRT 2011, provide a Results Viewer that makes it easy for you to look at your results and compare them to others. Results viewers range from fairly simple to quite sophisticated. The prevalence of spreadsheet applications and XML has led to benchmark creators minimizing the development costs of this component.

Next week, I’ll look at the components that handle the actual tests that make up the benchmark.

Bill

Comment on this post in the forums

Posted in Benchmarks in general, Let us know your thoughts |

Benchmarking a benchmark

By Bill Catchings

on July 13, 2011

One of the challenges of any benchmark is understanding its characteristics. The goal of a benchmark is to measure performance under a defined set of circumstances. For system-level, application-oriented benchmarks, it isn’t always obvious how individual components in the system influence the overall score. For instance, how does doubling the amount of memory affect the benchmark score? The best way to understand the characteristics of a benchmark is to run a series of carefully controlled experiments that change one variable at a time. To test the benchmark’s behavior with increased memory, you would take a system and run the benchmark with different amounts of RAM. Changing the processor, graphics subsystem, or hard disk lets you see the influence of those components. Some components, like memory, can change in both their amount and speed.

The full matrix of system components to test can quickly grow very large. While the goal is to change only one component at a time, this is not always possible. For example, you can’t change the processor from an Intel to an AMD without also changing the motherboard.

We are in the process of putting HDXPRT 2011 through a series of such tests. HDXPRT 2011 is a system-level, application-oriented benchmark for measuring the performance of PCs on consumer-oriented HD media scenarios. We want to understand, and share with you, how different components influence HDXPRT scores. We expect to release a report on our findings next week. It will include results detailing the effect of processor speed, amount of RAM, hard disk type, and graphics subsystem.

There is a tradeoff between the size of the matrix and how long it takes to produce the results. We’ve tried to choose the areas we felt were most important, but we’d like to hear what you consider important. So, what characteristics of HDXPRT 2011 would you like to see us test?

Bill

Comment on this post in the forums

Posted in Application-based benchmarks, Benchmark metrics, HDXPRT 2011 results, Let us know your thoughts, Performance benchmarking, What makes a good benchmark? |

Device or computer?

By Bill Catchings

on July 6, 2011

As you may have noticed, I am fascinated by performance. I’m also an avid cyclist and techno geek. The recent start of the Tour de France has turned my thoughts to the technology of bikes and their accessories. As with most technology, the latest models promise to be faster, lighter, and better.

One accessory of particular interest to me is the bike “computer.” When I first started serious riding six years ago, bike computers were pretty minimal devices. They were generally a small LCD display that connected via wires to two sensors. One sensor counted how quickly a magnet on the pedal passed by to determine the cyclist’s cadence (pedal strokes per minute). The other sensor counted how quickly a magnet on one of the wheels passed by. Knowing the circumference of the wheel, it calculated the cyclist’s speed and distance traveled. Sure there had to be a processor of some sort in those bike computers, but I always refused to call it a bike computer. Speedometer/odometer seemed more accurate to me.

Now, however, I have on my bike a Garmin Edge 500 (https://buy.garmin.com/shop/shop.do?cID=160&pID=36728). It is a small device—less than 2 inches by 3 inches—that attaches to my handle bars and determines my speed and distance via a built-in GPS. It determines altitude by detecting changes in barometric pressure and temperature by a built-in thermometer. It communicates wirelessly with my heart rate monitor. It can also talk wirelessly to other devices, like a cadence sensor or a power meter that measures the power applied to the pedals. The LCD screen is customizable and allows me to display the information I most care about while riding. The Edge 500 collects all of the data and can upload it via a computer to the Garmin Connect Web site.

By any definition of computer, the Edge 500 seems to qualify. I still don’t call it a computer, however. Calling it a speedometer/odometer would be silly. I tend to refer to it as my Garmin. The line between computer and device is definitely getting blurrier.

We are all surrounded by more and more computing devices, whether they are desktops, notebooks, tablets, smart phones, or bike computers. On some of those, performance is critical while on others, fast enough is all we care about. On which devices do you think performance is important? Even as we start the work on HDXPRT 2012, we are constantly examining other areas and types of devices that need benchmarks. Let us know your thoughts!

Bill

Comment on this post in the forums

Posted in Benchmarking computing devices, Let us know your thoughts |

Category: Benchmarking

Getting to the source

Keeping score

Anatomy of a benchmark, part II

Anatomy of a benchmark, part I

Benchmarking a benchmark

Device or computer?

Check out the other XPRTs: