BenchmarkXPRT Blog banner

Category: Benchmarks in general

WebXPRT 5: Starting to assemble the pieces

In our last blog post, we shared the exciting news that we’re currently working on WebXPRT 5. In that post, we described some of the ways that WebXPRT may evolve with the release of WebXPRT 5. In today’s post, we’ll revisit some of the points of emphasis from the last post and focus on potential workload changes in a bit more detail.

With any benchmark development project, there are always technical challenges you need to iron out. That is especially true with a cross-platform, browser-based benchmark like WebXPRT. Because we’re in the middle of exploring the technical feasibility of a few of the options we’ll mention, we’re not yet ready to say for certain that all these features will be available in the initial WebXPRT 5 release. We can, however, now paint a clearer picture of the overall direction we’re headed.

In the section below, you’ll find updated info on where we stand with respect to some of the key development focal points we discussed in our last post. If there’s an item from that post or previous posts that we didn’t mention below—such as updating the test harness—it doesn’t mean that we’re dropping that goal. We’re just focusing on workloads today.

One of our key goals with WebXPRT 5 is providing more AI-related workloads. In past blog posts, we’ve discussed the growing importance of local, browser-side AI. With WebXPRT 5, we’re investigating two ways that we can expand WebXPRT’s AI portfolio: 1) updating existing WebXPRT 4 AI-oriented workloads, and 2) adding all-new AI workloads.

Here are some possible ways those AI-related changes may play out in both categories:

Updating existing WebXPRT 4 AI-oriented workloads

  • Splitting the existing Organize Album using AI workload’s timed tasks—face detection and image classification—into two independent workloads.
  • Updating the face detection and image classification tasks with the latest versions of the OpenCV.js computer vision and machine learning libraries.
  • Updating the Caffe deep learning framework for the face detection task.
  • Updating the ONNX-based SqueezeNet machine learning model for the image classification tasks.
  • Updating the version of the Tesseract.js OCR engine that WebXPRT uses in the Encrypt Notes and OCR Scan workload. 

Potentially adding all-new AI workloads (either core or experimental workloads)

  • We’re exploring the idea of including a workload that uses an AI-powered segmentation model to blur the background of a video call.
  • We’re exploring the feasibility of including a local LLM chat workload.
  • We would eventually like to include a WebGPU-based web AI framework for a computer vision workload.

In addition to the goal of adding more AI, we previously discussed the possibility of adding non-AI WebGPU workloads. As a web API, WebGPU enables web-based applications—such as image-based GenAI and inference workloads—to directly access the graphics rendering and computational capabilities of a system’s GPU. In the future, WebXPRT 5 could use that technology to execute complex 3D rendering workloads.

We hope today’s post gives you a better sense of where WebXPRT 5 may be headed. We want to reemphasize that while we are actively investigating the possible changes mentioned above, nothing is set in stone. As the pieces start to fall into place, we’ll provide more information here in the blog.

If you have any questions or comments about WebXPRT 5, please feel free to contact us!

Justin

Check out the new XPRTs around the world infographic!

If you’ve followed the XPRT blog for a while, you know that we occasionally update the community on some of the reach metrics we track by publishing a new version of the “XPRTs around the world” infographic. The metrics we track include completed test runs, benchmark downloads, and mentions of the XPRTs in advertisements, articles, and tech reviews. Gathering this information gives us insight into how many people are using the XPRT tools, and updating the infographic helps readers and community members see the impact the XPRTs are having around the world.

This week, we published a new infographic, which include the following highlights:

  • The XPRTs have been mentioned more than 13,900 times on over 4,000 unique sites.
  • Those mentions include more than 10,300 articles and reviews.
  • Those mentions originated in over 629 cities located in 67 countries on six continents. New cities of note include Bangalore, India; Donetsk, Ukraine; Lima, Peru; and Santiago, Chile.
  • The BenchmarkXPRT Development Community now includes 230 members from 76 companies and organizations around the world.


In addition to the growth in web mentions and community members, the XPRTs have now delivered more than 520,000 real-world results! We’re grateful for everyone who’s helped us get this far. Your participation is vital to our achieving our goal: to provide benchmark tools that are reliable, relevant, and easy to use.

Justin

Apples and pears vs. oranges and bananas

When people talk about comparing disparate things, they often say that you’re comparing apples and oranges. However, sometimes that expression doesn’t begin to describe the situation.

Recently, Justin wrote about using CrXPRT on systems running Neverware CloudReady OS. In that post, he noted that we couldn’t guarantee that using CrXPRT on CloudReady and Chrome OS systems would be a fair comparison. Not surprisingly, that prompted the question “Why not?”

Here’s the thing: It’s a fair comparison of those software stacks running on those hardware configurations. If everyone accepted that and stopped there, all would be good. However, almost inevitably, people will read more into the scores than is appropriate.

In such a comparison, we’re changing multiple variables at once. We’ve written before about the effect of the software stack on performance. CloudReady and Chrome OS are two different implementations of the Chromium OS, and it’s possible that one is more efficient than the other. If so, that would affect CrXPRT scores. At the same time, the raw performance of the two hardware configurations under test could also differ to a certain degree, which would also affect CrXPRT scores.

Here’s a metaphor: If you measure the effective force at the end of two levers and find a difference, to what do you attribute that difference? If you know the levers are the same length, you can attribute the difference to the amount of applied force. If you know the applied force is identical, you can attribute the difference to the length of the levers. If you lack both of those data points, you can’t know whether the difference is due to the length, the force, or a combination of the two.

With a benchmark, you can run multiple experiments designed to isolate variables and use the results from those experiments to look for trends. For example, we could install both CloudReady OS and Chrome OS on the same Intel-based Chromebook and compare the CrXPRT results. Because that removes hardware differences as a variable, such an experiment would offer some insight into how the two implementations compare. However, because differences in hardware can affect the performance of a given piece of software, this single data point would be of limited value. We could repeat the experiment on a variety of other Intel-based Chromebooks, and other patterns might emerge. If one of the implementations consistently scored higher, that would suggest that it was more efficient than the other, but would still not be definitively conclusive.

I hope this gives you some idea about why we are cautious about drawing conclusions when comparing results from different sets of hardware running different software stacks.

Eric

Digging deeper

From time to time, we like to revisit the fundamentals of the XPRT approach to benchmark development. Today, we’re discussing the need for testers and benchmark developers to consider the multiple factors that influence benchmark results. For every device we test, all of its hardware and software components have the potential to affect performance, and changing the configuration of those components can significantly change results.

For example, we frequently see significant performance differences between different browsers on the same system. In our recent recap of the XPRT Weekly Tech Spotlight’s first year, we highlighted an example of how testing the same device with the same benchmark can produce different results, depending on the software stack under test. In that instance, the Alienware Steam Machine entry included a WebXPRT 2015 score for each of the two browsers that consumers were likely to use. The first score (356) represented the SteamOS browser app in the SteamOS environment, and the second (441) represented the Iceweasel browser (a Firefox variant) in the Linux-based desktop environment. Including only the first score would have given readers an incomplete picture of the Steam Machine’s web-browsing capabilities, so we thought it was important to include both.

We also see performance differences between different versions of the same browser, a fact especially relevant to those who use frequently updated browsers, such as Chrome. Even benchmarks that measure the same general area of performance, for example, web browsing, are usually testing very different things.

OS updates can also have an impact on performance. Consumers might base a purchase on performance or battery life scores and end up with a device that behaves much differently when updated to a new version of Android or iOS, for example.

Other important factors in the software stack include pre-installed software, commonly referred to as bloatware, and the proliferation of apps that sap performance and battery life.

This is a much larger topic than we can cover in the blog. Let the examples we’ve mentioned remind you to think critically about, and dig deeper into, benchmark results. If we see published XPRT scores that differ significantly from our own results, our first question is always “What’s different between the two devices?” Most of the time, the answer becomes clear as we compare hardware and software from top to bottom.

Justin

Another great year

A lot of great stuff happened this year! In addition to releasing new versions of the benchmarks, videos, infographics, and white papers, we released our first-ever German UI and sponsored our first student partnership at North Carolina State University. We visited three continents to promote the XPRTs and saw XPRT results published in six of them (we’re still working on Antarctica).

Perhaps most exciting, we reached our fifth anniversary. Users have downloaded or run the XPRTs over 100,000 times.

As great as the year has been, we are sprinting into 2016. Though I can’t talk about them yet, there are some big pieces of news coming soon. Even sooner, I will be at CES next week. If you would like to talk about the XPRTs or the future of benchmarking, let me know and we’ll find a time to meet.

Whatever your holiday traditions are, I hope you are having a great holiday season. Here’s wishing you all the best in 2016!

Eric

What makes a good benchmark?

As we discussed recently, we’re working on the design document for the next version of MobileXPRT, and we’re really interested in any ideas you may have. However, we haven’t talked much about what makes for a good benchmark test.

The things we measure need to be quantifiable. A reviewer can talk about the realism of game play, or the innovative look of a game, and those are valid observations. However, it is difficult to convert those kinds of subjective impressions to numbers.

The things we measure must also be repeatable. For example, the response time for an online service may depend on the time of day, number of people using the service at the time, network load, and other factors that change over time. You can measure the responsiveness of such services, but doing so requires repeating the test enough times under enough different circumstances to get a representative sample.

The possible things we can measure go beyond the speed of the device to include things such as battery life and compatibility with standards, and even fidelity or quality such as with photos or video. BatteryXPRT and CrXPRT test battery life, while the HTML5 tests in WebXPRT are among those that test compatibility. We are currently looking into quality metrics for possible future tools.

I hope this has given you some ideas. If so, let us know!

Eric

Check out the other XPRTs:

Forgot your password?