BenchmarkXPRT Blog banner

Category: Benchmarking computing devices

Browser-based AI tests in WebXPRT 4: face detection and image classification

I recently revisited an XPRT blog entry that we posted from CES Las Vegas back in 2020. In that post, I reflected on the show’s expanded AI emphasis, and I wondered if we were reaching a tipping point where AI-enhanced and AI-driven tools and applications would become a significant presence in people’s daily lives. It felt like we were approaching that point back then with the prevalence of AI-powered features such as image enhancement and text recommendation, among many others. Now, seamless AI integration with common online tasks has become so widespread that many people unknowingly benefit from AI interactions several times a day.

As AI’s role in areas like everyday browser activity continues to grow—along with our expectations for what our consumer devices should be able to handle—reliable AI-oriented benchmarking is more vital than ever. We need objective performance data that can help us understand how well a new desktop, laptop, tablet, or phone will handle AI tasks.

WebXPRT 4 already includes timed AI tasks in two of its workloads: the “Organize Album using AI” workload and the “Encrypt Notes and OCR Scan” workload. These two workloads reflect the types of light browser-side inference tasks that are now fairly common in consumer-oriented web apps and extensions. In today’s post, we’ll provide some technical information about the Organize Album workload. In a future post, we’ll do the same for the Encrypt Notes workload.

The Organize Album workload includes two different timed tasks that reflect a common scenario of organizing online photo albums. The workload utilizes the AI inference and JavaScript capabilities of the WebAssembly (Wasm) version of OpenCV.js—an open-source computer vision and machine learning library. In WebXPRT 4, we used OpenCV.js version 4.5.2.

Here are the details for each task:

  • The first task measures the time it takes to complete a face detection job with a set of five 720 x 480 photos that we sourced from commercial photo sites. The workload loads a Caffe deep learning framework model (res10_300x300_ssd_iter_140000_fp16.caffemodel) using the commands found here
  • The second task measures the time it takes to complete an image classification job (labeling based on object detection) with a different set of five 718 x 480 photos that we sourced from the ImageNet computer vision dataset. The workload loads an ONNX-based SqueezeNet machine learning model (squeezenet.onnx v 1.0) using the commands found here.

To produce a score for each iteration of the workload, WebXPRT calculates the total time that it takes for a system to organize both albums. In a standard test, WebXPRT runs seven iterations of the entire six-workload performance suite before calculating an overall test score. You can find out more about the WebXPRT results calculation process here.

We hope this post will give you a better sense of how WebXPRT 4 measures one kind of AI performance. As a reminder, if you want to dig into the details at a more granular level, you can access the WebXPRT 4 source code for free. In previous blog posts, you can find information about how to access and use the code. You can also read more about WebXPRT’s overall structure and other workloads in the Exploring WebXPRT 4 white paper.

If you have any questions about this workload or any other aspect of WebXPRT 4, please let us know!

Justin

WebXPRT in PT reports

We don’t just make WebXPRT—we use it, too. If you normally come straight to BenchmarkXPRT.com or WebXPRT.com, you may not even realize that Principled Technologies (PT) does a lot more than just managing and administering the BenchmarkXPRT Development Community. We’re also the tech world’s leading provider of hands-on testing and related fact-based marketing services. As part of that work, we’re frequent WebXPRT users.

We use the benchmark when we test devices such as Chromebooks, desktops, mobile workstations, and consumer laptops for our clients. (You can see a lot of that work and many of our clients on our public marketing portfolio page.) We run the benchmark for the same reasons that others do—it’s a reliable and easy-to-use tool for measuring how well devices handle web browsing and other web work.

We also sometimes use WebXPRT simply because our clients request it. They request it for the same reason the rest of us like and use it: it’s a great tool. Regardless of job titles and descriptions, most laptop and tablet users surf the web and access web-based applications every day. Because WebXPRT is a browser benchmark, higher scores on it could indicate that a device may provide a superior online experience.

Here are just a few of the recent PT reports that used WebXPRT:

  • In a project for Dell, we compared the performance of a Dell Latitude 7340 Ultralight to that of a 13-inch Apple MacBook Air (2022).
  • In this study for HP, we compared the performance of an HP ZBook Firefly G10, an HP ZBook Power G10, and an HP ZBook Fury G10.
  • Finally, in a set of comparisons for Lenovo, we evaluated the system performance and end-user experience of eight Lenovo ThinkBook, ThinkCentre, and ThinkPad systems along with their Apple counterparts.

All these projects, and many more, show how a variety of companies rely on PT—and on WebXPRT—to help buyers make informed decisions. P.S. If we publish scores from a client-commissioned study in the WebXPRT 4 results viewer, we will list the source as “PT”, because we did the testing.

By Mark L. Van Name and Justin Greene

WebXPRT benchmarking tips from the XPRT lab

Occasionally, we receive inquiries from XPRT users asking for help determining why two systems with the same hardware configuration are producing significantly different WebXPRT scores. This can happen for many reasons, including different software stacks, but score variability can also result from different testing behaviors and environments. While some degree of variability is normal, these types of questions provide us with an opportunity to talk about some of the basic benchmarking practices we follow in the XPRT lab to produce the most consistent and reliable scores.

Below, we list a few basic best practices you might find useful in your testing. Most of them relate to evaluating browser performance with WebXPRT, but several of these practices apply to other benchmarks as well.

  • Hardware is not the only important factor: Most people know that different browsers produce different performance scores on the same system. Testers are not, however, always aware of shifts in performance between different versions of the same browser. While most updates don’t have a large impact on performance, a few updates have increased (or even decreased) browser performance by a significant amount. For this reason, it’s always important to record and disclose the extended browser version number for each test run. The same principle applies to any other relevant software.
  • Keep a thorough record of system information: We record detailed information about a test system’s key hardware and software components, including full model and version numbers. This information is not only important for later disclosure if we choose to publish a result, it can also sometimes help to pinpoint system differences that explain why two seemingly identical devices are producing very different scores. We also want people to be able to reproduce our results to the closest extent possible, so that commitment involves recording and disclosing more detail than you’ll find in some tech articles and product reviews.
  • Test with clean images: We typically use an out-of-box (OOB) method for testing new devices in the XPRT lab. OOB testing means that other than running the initial OS and browser version updates that users are likely to run after first turning on the device, we change as little as possible before testing. We want to assess the performance that buyers are likely to see when they first purchase the device and before they install additional software. This is the best way to provide an accurate assessment of the performance retail buyers will experience from their new devices. That said, the OOB method is not appropriate for certain types of testing, such as when you want to compare as close to identical system images as possible, or when you want to remove as much pre-loaded software as possible.
  • Turn off automatic updates: We do our best to eliminate or minimize app and system updates after initial setup. Some vendors are making it more difficult to turn off updates completely, but you should always double-check update settings before testing.
  • Get a baseline for system processes: Depending on the system and the OS, a significant amount of system-level activity can be going on in the background after you turn it on. As much as possible, we like to wait for a stable baseline (idle time) of system activity before kicking off a test. If we start testing immediately after booting the system, we often see higher variance in the first run before the scores start to tighten up.
  • Use more than one data point: Because of natural variance, our standard practice in the XPRT lab is to publish a score that represents the median from three to five runs, if not more. If you run a benchmark only once and the score differs significantly from other published scores, your result could be an outlier that you would not see again under stable testing conditions or over the course of multiple runs.


We hope these tips will help make your testing more accurate. If you have any questions about WebXPRT, the other XPRTs, or benchmarking in general, feel free to ask!

Justin

Best practices in benchmarking

From time to time, a tester writes to ask for help determining why they see different WebXPRT scores on two systems that have the same hardware configuration. The scores sometimes differ by a significant percentage. This can happen for many reasons, including different software stacks, but score variability can also result from different testing behavior and environments. While a small amount of variability is normal, these types of questions provide an opportunity to talk about the basic benchmarking practices we follow in the XPRT lab to produce the most consistent and reliable scores.

Below, we list a few basic best practices you might find useful in your testing. Most of them relate to evaluating browser performance with WebXPRT, but several of these practices apply to other benchmarks as well.

  • Test with clean images: We typically use an out-of-box (OOB) method for testing new devices in the XPRT lab. OOB testing means that other than running the initial OS and browser version updates that users are likely to run after first turning on the device, we change as little as possible before testing. We want to assess the performance that buyers are likely to see when they first purchase the device, before installing additional apps and utilities. This is the best way to provide an accurate assessment of the performance retail buyers will experience. While OOB is not appropriate for certain types of testing, the key is to not test a device that’s bogged down with programs that will influence results.
  • Turn off automatic updates: We do our best to eliminate or minimize app and system updates after initial setup. Some vendors are making it more difficult to turn off updates completely, but you should always double-check update settings before testing.
  • Get a baseline for system processes: Depending on the system and the OS, a significant amount of system-level activity can be going on in the background after you turn it on. As much as possible, we like to wait for a stable (idle) baseline of system activity before kicking off a test. If we start testing immediately after booting the system, we often see higher variance in the first run before the scores start to tighten up.
  • Hardware is not the only important factor: Most people know that different browsers produce different performance scores on the same system. However, testers aren’t always aware of shifts in performance between different versions of the same browser. While most updates don’t have a large impact on performance, a few updates have increased (or even decreased) browser performance by a significant amount. For this reason, it’s always worthwhile to record and disclose the extended browser version number for each test run. The same principle applies to any other relevant software.
  • Use more than one data point: Because of natural variance, our standard practice in the XPRT lab is to publish a score that represents the median from three to five runs, if not more. If you run a benchmark only once, and the score differs significantly from other published scores, your result could be an outlier that you would not see again under stable testing conditions.

We hope these tips will help make your testing more accurate. If you have any questions about the XPRTs, or about benchmarking in general, feel free to ask!

Justin

Understanding concurrent instances in AIXPRT

Over the past few weeks, we’ve discussed several of the key configuration variables in AIXPRT, such as batch size and level of precision. Today, we’re discussing another key variable: number of concurrent instances. In the context of machine learning inference, this refers to how many instances of the network model (ResNet-50, SSD-MobileNet, etc.) the benchmark runs simultaneously.

By default, the toolkits in AIXPRT run one instance at a time and distribute the compute load according to the characteristics of the CPU or GPU under test, as well as any relevant optimizations or accelerators in the toolkit’s reference library. By setting the number of concurrent instances to a number greater than one, a tester can use multiple CPUs or GPUs to run multiple instances of a model at the same time, usually to increase throughput.

With multiple concurrent instances, a tester can leverage additional compute resources to potentially achieve higher throughput without sacrificing latency goals.

In the current version of AIXPRT, testers can run multiple concurrent instances in the OpenVINO, TensorFlow, and TensorRT toolkits. When AIXPRT Community Preview 3 becomes available, this option will extend to the MXNet toolkit. OpenVINO and TensorRT automatically allocate hardware for each instance and don’t let users make manual adjustments. TensorFlow and MXNet require users to manually bind instances to specific hardware. (Manual hardware allocation for multiple instances is more complicated than we can cover today, so we may devote a future blog entry to that topic.)

Setting the number of concurrent instances in AIXPRT

The screenshot below shows part of a sample config file (the same one we used when we discussed batch size and precision). The value in the “concurrent instances” row indicates how many concurrent instances will be operating during the test. In this example, the number is one. To change that value, a tester simply replaces it with the desired number and saves the changes.

Config_snip

If you have any questions or comments (about concurrent instances or anything else), please feel free to contact us.

Justin

Transparent goals

Recently, Forbes published an article discussing a new report on phone battery life from Which?, a UK consumer advocacy group. In the report, Which? states that they tested the talk time battery life of 50 phones from five brands. During the tests, phones from three of the brands lasted longer than the manufacturers’ claims, while phones from another brand underperformed by about five percent. The fifth brand’s published battery life numbers were 18 to 51 percent higher than Which? recorded in their tests.

Folks can read the article for more details about the tests and the brands. While the report raises some interesting questions, and the article provides readers with brief test methodology descriptions from Which? and one manufacturer, we don’t know enough about the tests to say which set of claims is correct. Any number of variables related to test workloads or device configuration settings could significantly affect the results. Both parties may be using sound benchmarking principles in good faith, but their test methodologies may not be comparable. As it is, we simply don’t have enough information to evaluate the study.

Whether the issue is battery life or any other important device spec, information conflicts, such as the one that the Forbes article highlights, can leave consumers scratching their heads, trying to decide which sources are worth listening to. At the XPRTs, we believe that the best remedy for this type of problem is to provide complete transparency into our testing methodologies and development process. That’s why our lab techs verify all the hardware specs for each XPRT Weekly Tech Spotlight entry. It’s why we publish white papers explaining the structure of our benchmarks in detail, as well as how the XPRTs calculate performance results. It’s also why we employ an open development community model and make each XPRT’s source code available to community members. When we’re open about how we do things, it encourages the kind of honest dialogue between vendors, journalists, consumers, and community members that serves everyone’s best interests.

If you love tech and share that same commitment to transparency, we’d love for you to join our community, where you can access XPRT source code and previews of upcoming benchmarks. Membership is free for anyone with a verifiable corporate affiliation. If you have any questions about membership or the registration process, please feel free to ask.

Justin

Check out the other XPRTs:

Forgot your password?