Category: HDXPRT workloads

HDXPRT 2012 characterization study

on May 4, 2012

For HDXPRT 2011, we did quite a bit of testing to characterize the benchmark. Those results appeared in an initial white paper and a follow up one. In the first, we ran tests on different processors, various amounts of RAM, internal vs. external graphics, hard disk vs. SSD, and the effects of Intel Turbo Boost Technology. In the follow up white paper, we looked in more depth at the effect of graphics cards with different processors.

I mentioned a couple weeks ago that we are starting to put together a testbed to help us characterize HDXPRT 2012. What is in that testbed will define what characteristics of the upcoming benchmark we measure. We would like to get your help defining that testbed.

Our current thinking is to do a similar set of tests this year with updated hardware. However, plenty of additional things would be interesting to look at. First, I would like to increase the range of processors we test, including AMD processors. I would also like to do some testing varying different processor characteristics such as threads, cores, and frequency. It might also be good to look at the effect of new technologies like hybrid drives (which combine a small SSD with a hard disk to try and have the best of both).

We face two challenges in doing these characterization tests. One is to try and change one only variable at a time. That is very difficult in some cases, such as comparing Intel and AMD processors—you can’t just swap them in the same motherboard. Fortunately, it is usually possible to find very similar motherboards and keep other components (like disks, graphics, and RAM) constant. The other challenge is getting all of the necessary hardware in house.

So, we have two requests for you. First, let us know what you would like to see us test. Second, help us by supplying some of that equipment. If you supply the equipment we will do our best to include results from it in the characterization study and in the new HDXPRT 2012 results database. As always, thanks for your help!

Bill

Comment on this post in the forums

Posted in HDXPRT 2012, HDXPRT development process, HDXPRT workloads, Let us know your thoughts |

What to do with all the times

By Bill Catchings

on November 17, 2011

HDXPRT, like most other application-based benchmarks, works by timing lots of individual operations. Some other benchmarks just time the entire script. The downside of that approach is that the time includes things that are constant regardless of the speed of the underlying hardware. Some things, like how fast a menu drops down or text scrolls, are tied to the user experience and should not go faster on a faster system. Including those items in the overall time dilutes the importance of the operations that we wait on and are frustrated by, the operations we need to time.

In the case of HDXPRT 2011, we time between 20 and 30 operations. We then roll these up into the times we report as well as the overall score. We do not, however, report the individual times. We expect to include even more timed operations in HDXPRT 2012. As we have been thinking about what the right metrics are, we have started to wonder what to do with all of those times. We could total up the times of similar operations and create additional results. For example, we could total up all the application load times and produce an application-load result. Or, we could total up all the times for an individual application and produce an application result. I can definitely see value in results like those.

Another possibility is to try and look at the general pattern of the results to understand responsiveness. One way would be to collect the times in a histogram, where buckets correspond to ranges of response times for the operations. Such a histogram might give a sense of how responsive a target system feels to an end user. There are certainly other possibilities as well.

If nothing else, I think it makes sense to expose these times in some way. If we make them available, I’m confident that people will find ways to use them. My concern is the danger of burdening a benchmark with too many results. The engineer in me loves all the data possible. The product designer knows that focus is critical. Successful benchmarks have one or maybe two results. How to balance the two?

One wonder of this benchmark development community is the ability to ask you what you think. What would you prefer, simple and clean or lots of numbers? Maybe a combination where we just have the high-level results we have now, but also make other results or times available in an “expert” or an “advanced” mode? What do you think?

Bill

Comment on this post in the forums

Posted in HDXPRT Development Community benefits, HDXPRT metrics, HDXPRT workloads, Let us know your thoughts |

Scoring with HDXPRT

By Bill Catchings

on August 31, 2011

Two weeks ago, I began explaining how benchmarks keep score (http://www.hdxprt.com/blog/2011/08/17/keeping-score/). HDXPRT 2011 fundamentally measures the time a PC required to complete a series of tasks, such as editing photos and converting videos from one format to another. It uses the times of three sets of tasks to come up with three use case times (Edit videos from your camcorder, Create memories from your digital camera, and Prepare media for on-the-go). Because an early version of the benchmark took too long to run, we trimmed the size of the workloads (such as the number of photos) to make it complete more quickly. Because we believed the size of the original workloads was realistic, we extrapolated (multiplied by the difference in size) what the time would have been. That process results in times in minutes.

We could have simply combined the three times into one total time, but doing so would have created a score where smaller is better, which can be confusing. To avoid this, HDXPRT 2011 normalizes the three times to the times a calibration, or base, system required to complete the same work. The benchmark then calculates a geometric mean of those three normalized scores and multiplies that number by 100 to create the overall Create HD Score. This scoring method sets the calibration system’s score to 100 and makes it easy for you to compare multiple systems. For example, if PC A gets a score of 200, and PC B gets a 400, PC B is twice the speed of PC A (and four times the speed of the calibration system) at creating HD content.

The term “geometric mean” might be unfamiliar. One way to get benchmark geeks arguing is to ask about the correct mean for combining results. (Yes, there really are enough of us for an argument.) At the risk of inflaming my fellow benchmark geeks, I will give a quick summary of the main ways people combine results.

An arithmetic mean is a simple average, where you add all the numbers and divide by the number of numbers. It is good for combining amounts, such as gigabytes of RAM, across multiple computers.

A geometric mean is more mathematically complex. You compute it by multiplying all the numbers and then taking the nth root, where n is the number of numbers. This kind of mean is appropriate for combining normalized numbers. Its advantage over the arithmetic mean is that it keeps one really good number from drowning out all the others.

The final mean is the harmonic. You calculate it by dividing the number of numbers by the sum of 1 divided by the square of each element. (If that makes little sense to you, don’t worry about it!) The harmonic mean is appropriate for combining rates, such as megabytes per second.

I should also mention one other result from HDXPRT 2011, the Overall Play HD Experience score. This is a very different kind of score that uses one to five stars to indicate the quality of three HD video playbacks. HDXPRT uses mean opinion scores (MOS) based on smoothness of playback to compute these results. (I’ll discuss MOS in more detail in a future blog.) With this kind of score, a four-star rating is better than a two-star rating, but it is hard to say how much better. The MOS research indicates that people would rate the four-star playback as good and the two-star playback as poor, but you can’t say that one is twice as good as the other because the relationship is not linear.

What do you think of the metrics that HDXPRT 2011 provides? Are there others you would find more useful or meaningful? Your input is vital to improving the benchmark and making sure it does what you want it to do.

Bill

Comment on this post in the forums

Posted in HDXPRT, HDXPRT 2011 results, HDXPRT capabilities, HDXPRT development process, HDXPRT metrics, HDXPRT workloads |

Anatomy of a benchmark, part II

By Bill Catchings

on August 10, 2011

As we discussed last week, benchmarks (including HDXPRT 2011) are made up of a set of common major components. Last week’s components included the Installer, User Interface (UI), and Results Viewer. This week, we’ll look more at the guts of a benchmark—the parts that actually do the performance testing.

Once the UI gets the necessary commands and parameters from the user, the Test Harness takes over. This part is the logic that runs the individual Tests or Workloads using the parameters you specified. For application-based benchmarks, the harness is particularly critical, because it has to deal with running real applications. (Simpler benchmarks may mix the harness and test code in a single program.)

The next component consists of the Tests or Workloads themselves. Some folks use those terms interchangeably, but I try to avoid that practice. I tend to think of tests as specially crafted code designed to gauge some aspect of a system’s performance, while workloads consist of a set of actions that an application must take as well as the necessary data for those actions. In HDXPRT 2011, each workload is a set of data (such as photos) and actions (e.g., manipulations of those photos) that an application (e.g., Photoshop Elements) performs. Application-based benchmarks, such as HDXPRT 2011, typically use some other program or technology to pass commands to the applications. HDXPRT uses a combination of AutoIT and C code to drive the applications.

When the Harness finishes running the tests or workloads, it collects the results. It then passes those results either to the Results Viewer or writes them to a file for viewing in Excel or some other program.

As we look to improve HDXPRT for next year, what improvements would you like to see in each of those areas?

Bill

Comment on this post in the forums

Posted in Benchmarks in general, HDXPRT workloads, Let us know your thoughts |

Waiting sucks

By Mark Van Name

on May 25, 2011

You know it does. Time is the most precious commodity, the one thing you can never get back. So when someone or something makes you wait, it sucks.

It particularly sucks when you have to wait on your PC. It’s your computer, after all, and it should do the work and be quick about it. For many tasks, it is quick, almost instantaneous. Some, though, require so much work that the computer can spend a lot of time doing them, leaving you waiting. Tasks that involve working with different types of media often fall into that category.

Which is exactly why we have HDXPRT.

It gives you a way to compare how long different PCs require to perform some common media-manipulation tasks. Because those times can be significant—sometimes many seconds, but also sometimes many minutes—HDXPRT can give you valuable information that you can factor into your PC buying plans.

After all, the faster a PC is at this sort of work, the less time you’ll spend waiting on it—and that’s a good thing.

Mark Van Name

Comment on this post in the forums

Posted in HDXPRT metrics, HDXPRT workloads |

Category: HDXPRT workloads

HDXPRT 2012 characterization study

What to do with all the times

Scoring with HDXPRT

Anatomy of a benchmark, part II

Waiting sucks

Check out the other XPRTs: