BenchmarkXPRT Blog banner

Category: Benchmarking computing devices

Best practices in benchmarking

From time to time, a tester writes to ask for help determining why they see different WebXPRT scores on two systems that have the same hardware configuration. The scores sometimes differ by a significant percentage. This can happen for many reasons, including different software stacks, but score variability can also result from different testing behavior and environments. While a small amount of variability is normal, these types of questions provide an opportunity to talk about the basic benchmarking practices we follow in the XPRT lab to produce the most consistent and reliable scores.

Below, we list a few basic best practices you might find useful in your testing. Most of them relate to evaluating browser performance with WebXPRT, but several of these practices apply to other benchmarks as well.

  • Test with clean images: We typically use an out-of-box (OOB) method for testing new devices in the XPRT lab. OOB testing means that other than running the initial OS and browser version updates that users are likely to run after first turning on the device, we change as little as possible before testing. We want to assess the performance that buyers are likely to see when they first purchase the device, before installing additional apps and utilities. This is the best way to provide an accurate assessment of the performance retail buyers will experience. While OOB is not appropriate for certain types of testing, the key is to not test a device that’s bogged down with programs that will influence results.
  • Turn off automatic updates: We do our best to eliminate or minimize app and system updates after initial setup. Some vendors are making it more difficult to turn off updates completely, but you should always double-check update settings before testing.
  • Get a baseline for system processes: Depending on the system and the OS, a significant amount of system-level activity can be going on in the background after you turn it on. As much as possible, we like to wait for a stable (idle) baseline of system activity before kicking off a test. If we start testing immediately after booting the system, we often see higher variance in the first run before the scores start to tighten up.
  • Hardware is not the only important factor: Most people know that different browsers produce different performance scores on the same system. However, testers aren’t always aware of shifts in performance between different versions of the same browser. While most updates don’t have a large impact on performance, a few updates have increased (or even decreased) browser performance by a significant amount. For this reason, it’s always worthwhile to record and disclose the extended browser version number for each test run. The same principle applies to any other relevant software.
  • Use more than one data point: Because of natural variance, our standard practice in the XPRT lab is to publish a score that represents the median from three to five runs, if not more. If you run a benchmark only once, and the score differs significantly from other published scores, your result could be an outlier that you would not see again under stable testing conditions.

We hope these tips will help make your testing more accurate. If you have any questions about the XPRTs, or about benchmarking in general, feel free to ask!

Justin

Understanding concurrent instances in AIXPRT

Over the past few weeks, we’ve discussed several of the key configuration variables in AIXPRT, such as batch size and level of precision. Today, we’re discussing another key variable: number of concurrent instances. In the context of machine learning inference, this refers to how many instances of the network model (ResNet-50, SSD-MobileNet, etc.) the benchmark runs simultaneously.

By default, the toolkits in AIXPRT run one instance at a time and distribute the compute load according to the characteristics of the CPU or GPU under test, as well as any relevant optimizations or accelerators in the toolkit’s reference library. By setting the number of concurrent instances to a number greater than one, a tester can use multiple CPUs or GPUs to run multiple instances of a model at the same time, usually to increase throughput.

With multiple concurrent instances, a tester can leverage additional compute resources to potentially achieve higher throughput without sacrificing latency goals.

In the current version of AIXPRT, testers can run multiple concurrent instances in the OpenVINO, TensorFlow, and TensorRT toolkits. When AIXPRT Community Preview 3 becomes available, this option will extend to the MXNet toolkit. OpenVINO and TensorRT automatically allocate hardware for each instance and don’t let users make manual adjustments. TensorFlow and MXNet require users to manually bind instances to specific hardware. (Manual hardware allocation for multiple instances is more complicated than we can cover today, so we may devote a future blog entry to that topic.)

Setting the number of concurrent instances in AIXPRT

The screenshot below shows part of a sample config file (the same one we used when we discussed batch size and precision). The value in the “concurrent instances” row indicates how many concurrent instances will be operating during the test. In this example, the number is one. To change that value, a tester simply replaces it with the desired number and saves the changes.

Config_snip

If you have any questions or comments (about concurrent instances or anything else), please feel free to contact us.

Justin

Transparent goals

Recently, Forbes published an article discussing a new report on phone battery life from Which?, a UK consumer advocacy group. In the report, Which? states that they tested the talk time battery life of 50 phones from five brands. During the tests, phones from three of the brands lasted longer than the manufacturers’ claims, while phones from another brand underperformed by about five percent. The fifth brand’s published battery life numbers were 18 to 51 percent higher than Which? recorded in their tests.

Folks can read the article for more details about the tests and the brands. While the report raises some interesting questions, and the article provides readers with brief test methodology descriptions from Which? and one manufacturer, we don’t know enough about the tests to say which set of claims is correct. Any number of variables related to test workloads or device configuration settings could significantly affect the results. Both parties may be using sound benchmarking principles in good faith, but their test methodologies may not be comparable. As it is, we simply don’t have enough information to evaluate the study.

Whether the issue is battery life or any other important device spec, information conflicts, such as the one that the Forbes article highlights, can leave consumers scratching their heads, trying to decide which sources are worth listening to. At the XPRTs, we believe that the best remedy for this type of problem is to provide complete transparency into our testing methodologies and development process. That’s why our lab techs verify all the hardware specs for each XPRT Weekly Tech Spotlight entry. It’s why we publish white papers explaining the structure of our benchmarks in detail, as well as how the XPRTs calculate performance results. It’s also why we employ an open development community model and make each XPRT’s source code available to community members. When we’re open about how we do things, it encourages the kind of honest dialogue between vendors, journalists, consumers, and community members that serves everyone’s best interests.

If you love tech and share that same commitment to transparency, we’d love for you to join our community, where you can access XPRT source code and previews of upcoming benchmarks. Membership is free for anyone with a verifiable corporate affiliation. If you have any questions about membership or the registration process, please feel free to ask.

Justin

The value of speed

I was reading an interesting article on how high-end smartphones like the iPhone X, Pixel 2 XL, and Galaxy S8 generate more money from in-game revenue than cheaper phones do.

One line stood out to me: “With smartphones becoming faster, larger and more capable of delivering an engaging gaming experience, these monetization key performance indicators (KPIs) have begun to increase significantly.”

It turns out the game companies totally agree with the rest of us that faster devices are better!

Regardless of who is seeking better performance—consumers or game companies—the obvious question is how you determine which models are fastest. Many folks rely on device vendors’ claims about how much faster the new model is. Unfortunately, the vendors’ claims don’t always specify on what they base the claims. Even when they do, it’s hard to know whether the numbers are accurate and applicable to how you use your device.

The key part of any answer is performance tools that are representative, dependable, and open.

  • Representative – Performance tools need to have realistic workloads that do things that you care about.
  • Dependable – Good performance tools run reliably and produce repeatable results, both of which require that significant work go into their development and testing.
  • Open – Performance tools that allow people to access the source code, and even contribute to it, keep things above the table and reassure you that you can rely on the results.

Our goal with the XPRTs is to provide performance tools that meet all these criteria. WebXPRT 3 and all our other XPRTs exist to help accurately reveal how devices perform. You can run them yourself or rely on the wealth of results that we and others have collected on a wide array of devices.

The best thing about good performance tools is that everyone, even vendors, can use them. I sincerely hope that you find the XPRTs helpful when you make your next technology purchase.

Bill

Notes from the lab

This week’s XPRT Weekly Tech Spotlight featured the Alcatel A30 Android phone. We chose the A30, an Amazon exclusive, because it’s a budget phone running Android 7.0 (Nougat) right out of the box. That may be an appealing combination for consumers, but running a newer OS on inexpensive hardware such as what’s found in the A30 can cause issues for app developers, and the XPRTs are no exception.

Spotlight fans may have noticed that we didn’t post a MobileXPRT 2015 or BatteryXPRT 2014 score for the A30. In both cases, the benchmark did not produce an overall score because of a problem that occurs during the Create Slideshow workload. The issue deals with text relocation and significant changes in the Android development environment.

As of Android 5.0, on 64-bit devices, the OS doesn’t allow native code executables to perform text relocation. Instead, it is necessary to compile the executables using position-independent code (PIC) flags. This is how we compiled the current version of MobileXPRT, and it’s why we updated BatteryXPRT earlier this year to maintain compatibility with the most recent versions of Android.

However, the same approach doesn’t work for SoCs built with older 32-bit ARMv7-A architectures, such as the A30’s Qualcomm Snapdragon 210, so testers may encounter this issue on other devices with low-end hardware.

Testers who run into this problem can still use MobileXPRT 2015 to generate individual workload scores for the Apply Photo Effects, Create Photo Collages, Encrypt Personal Content, and Detect Faces workloads. Also, BatteryXPRT will produce an estimated battery life for the device, but since it won’t produce a performance score, we ask that testers use those numbers for informational purposes and not publication.

If you have any questions or have encountered additional issues, please let us know!

Justin

Celebrating one year of the XPRT Weekly Tech Spotlight

It’s been just over a year since we launched the XPRT Weekly Tech Spotlight by featuring our first device, the Google Pixel C. Spotlight has since become one of the most popular items at BenchmarkXPRT.com, and we thought now would be a good time to recap the past year, offer more insight into the choices we make behind the scenes, and look at what’s ahead for Spotlight.

The goal of Spotlight is to provide PT-verified specs and test results that can help consumers make smart buying decisions. We try to include a wide variety of device types, vendors, software platforms, and price points in our inventory. The devices also tend to fall into one of two main groups: popular new devices generating a lot of interest and devices that have unique form factors or unusual features.

To date, we’ve featured 56 devices: 16 phones, 11 laptops, 10 two-in-ones, 9 tablets, 4 consoles, 3 all-in-ones, and 3 small-form-factor PCs. The operating systems these devices run include Android, ChromeOS, iOS, macOS, OS X, Windows, and an array of vendor-specific OS variants and skins.

As much as possible, we test using out-of-the-box (OOB) configurations. We want to present test results that reflect what everyday users will experience on day one. Depending on the vendor, the OOB approach can mean that some devices arrive bogged down with bloatware while others are relatively clean. We don’t attempt to “fix” anything in those situations; we simply test each device “as is” when it arrives.

If devices arrive with outdated OS versions (as is often the case with Chromebooks), we update to current versions before testing, because that’s the best reflection of what everyday users will experience. In the past, that approach would’ve been more complicated with Windows systems, but the Microsoft shift to “Windows as a service” ensures that most users receive significant OS updates automatically by default.

The OOB approach also means that the WebXPRT scores we publish reflect the performance of each device’s default browser, even if it’s possible to install a faster browser. Our goal isn’t to perform a browser shootout on each device, but to give an accurate snapshot of OOB performance. For instance, last week’s Alienware Steam Machine entry included two WebXPRT scores, a 356 on the SteamOS browser app and a 441 on Iceweasel 38.8.0 (a Firefox variant used in the device’s Linux-based desktop mode). That’s a significant difference, but the main question for us was which browser was more likely to be used in an OOB scenario. With the Steam Machine, the answer was truly “either one.” Many users will use the browser app in the SteamOS environment and many will take the few steps needed to access the desktop environment. In that case, even though one browser was significantly faster than the other, choosing to omit one score in favor of the other would have excluded results from an equally likely OOB environment.

We’re always looking for ways to improve Spotlight. We recently began including more photos for each device, including ones that highlight important form-factor elements and unusual features. Moving forward, we plan to expand Spotlight’s offerings to include automatic score comparisons, additional system information, and improved graphical elements. Most importantly, we’d like to hear your thoughts about Spotlight. What devices and device types would you like to see? Are there specs that would be helpful to you? What can we do to improve Spotlight? Let us know!

Justin

Check out the other XPRTs:

Forgot your password?