BenchmarkXPRT Blog banner

Category: Benchmark metrics

Experience is the best teacher

One of the core principles that guides the design of the XPRT tools is they should reflect the way real-world users use their devices. The XPRTs try to use applications and workloads that reflect what users do and the way that real applications function. How did we learn how important this is? The hard way—by making mistakes! Here’s one example.

In the 1990s, I was Director of Testing for the Ziff-Davis Benchmark Operation (ZDBOp). The benchmarks ZDBOp created for its technical magazines became the industry standards, because of both their quality and Ziff-Davis’ leadership in the technical trade press.

WebBench, one of the benchmarks ZDBOp developed, measured the performance of early web servers. We worked hard to create a tool that used physical clients and tested web server performance over an actual network. However, we didn’t pay enough attention to how clients actually interacted with the servers. In the first version of WebBench, the clients opened connections to the server, did a small amount of work, closed the connections, and then opened new ones.

When we met with vendors after the release of WebBench, they begged us to change the model. At that time, browsers opened relatively long-lived connections and did lots of work before closing them. Our model was almost the opposite of that. It put vendors in the position of having to choose between coding to give their users good performance and coding to get good WebBench results.

Of course, we were horrified by this, and worked hard to make the next version of the benchmark reflect more closely the way real browsers interacted with web servers. Subsequent versions of WebBench were much better received.

This is one of the roots from which the XPRT philosophy grew. We have tried to learn and grow from the mistakes we’ve made. We’d love to hear about any of your experiences with performance tools so we can all learn together.

Eric

Tracking device evolution with WebXPRT ’15, part 2

Last week, we used the Apple iPhone as a test case to show how hardware advances are often reflected in benchmark scores over time. When we compared WebXPRT 2015 scores for various iPhone models, we saw a clear trend of progressively higher scores as we moved from phones with an A7 chip to phones with A8, A9, and A10 Fusion chips. Performance increases over time are not surprising, but WebXPRT ’15 scores also showed us that upgrading from an iPhone 6 to an iPhone 6s is likely to have a much greater impact on web-browsing performance than upgrading from an iPhone 6s to an iPhone 7.

This week, we’re revisiting our iPhone test case to see how software updates can boost device performance without any changes in hardware. The original WebXPRT ’15 tests for the iPhone 5s ran on iOS 8.3, and the original tests for the iPhone 6s, 6s Plus, and SE ran on variants of iOS 9. We updated each phone to iOS 10.0.2 and ran several iterations of WebXPRT ’15.

Upgrading from iOS 8.3 to iOS 10 on the iPhone 5s caused a 17% increase in web-browsing performance, as measured by WebXPRT. Upgrading from iOS 9 to iOS 10 on the iPhone 6s, 6s Plus, and SE produced web-browsing performance gains of 2.6%, 3.6%, and 3.1%, respectively.

The chart below shows the WebXPRT ’15 scores for a range of iPhones, with each iPhone’s iOS version upgrade noted in parentheses. The dark blue columns on the left represent the original scores, and the light blue columns on the right represent the upgrade scores.

Oct 27 iPhone chart

As with our hardware comparison last week, these scores are the median of a range of scores for each device in our database. These scores come both from our own testing and from device reviews from popular tech media outlets.

These results reinforce a message that we repeat often, that many factors other than hardware influence performance. Designing benchmarks that deliver relevant and reliable scores requires taking all factors into account.

What insights have you gained recently from WebXPRT ’15 testing? Let us know!

Justin

Tracking device evolution with WebXPRT ‘15

The XPRT Spotlight on the Apple iPhone 7 Plus gives us a great opportunity to look at the progression of WebXPRT 2015 scores for the iPhone line and see how hardware and software advances are often reflected in benchmark scores over time. This week, we’ll see how the evolution of Apple’s mobile CPU architecture has boosted web-browsing performance. In a future post, we’ll see the impact of iOS development.

As we’ve discussed in the past, multiple factors can influence benchmark results. While we’re currently using the iPhone as a test case, the same principles apply to all types of devices. We should also note that WebXPRT is an excellent gauge of expected web-browsing performance during real-world tasks, which is different than pure CPU performance in isolation.

When looking at WebXPRT ’15 scores in our database, we see that iPhone web-browsing performance has more than doubled in the last three years. In 2013, an iPhone 5s with an Apple A7 chip earned an overall WebXPRT ’15 score of 100. Today, a new iPhone 7 Plus with an A10 Fusion chip reports a score somewhere close to 210. The chart below shows the WebXPRT ’15 scores for a range of iPhones, with each iPhone’s CPU noted in parentheses.

Oct 20 iPhone chart

Moving forward from the A7 chip in the iPhone 5s to the A8 chip in the iPhone 6 and the A9 chip in the iPhone 6s and SE, we see consistent score increases. The biggest jump, at over 48%, appears in the transition from the A8 to the A9 chip, implying that folks upgrading from an iPhone 6 or 6 Plus to anything newer would notice a huge difference in web performance.

In general, folks upgrading from an A9-based phone (6S, 6S Plus, or SE) to an A10-based phone (7 and 7 Plus) could expect an increase in web performance of over 6.5%.

The scores we list represent the median of a range of scores for each device in our database. These scores come from our own testing, as well as from device reviews from media outlets such as AnandTech, Notebookcheck, and Tom’s Hardware. It’s worth noting that the highest A9 score in our database (AnandTech’s iPhone SE score of 205) overlaps with the lowest A10 Fusion score (Tom’s Hardware of Germany’s iPhone 7 score of 203), so while the improvement in median scores is clear, performance will vary according to individual phones and other factors.

Soon, we’ll revisit our iPhone test case to see how software updates can boost device performance without any changes in hardware. For more details on the newest iPhones, visit the Spotlight comparison page to see how iPhone 7 and 7 Plus specs and WebXPRT scores stack up.

Justin

Doing things a little differently

I enjoyed watching the Apple Event live yesterday. There were some very impressive announcements. (And a few which were not so impressive – the Breathe app would get on my nerves really fast!)

One thing that I was very impressed by was the ability of the iPhone 7 Plus camera to create depth-of-field effects. Some of the photos demonstrated how the phone used machine learning to identify people in the shot and keep them in focus while blurring the background, creating a shallow depth of field. This causes the subjects in a photo to really stand out. The way we take photos is not the only thing that’s changing. There was a mention of machine learning being part of Apple’s QuickType keyboard, to help with “contextual prediction.”

This is only one product announcement, but it’s a reminder that we need to be constantly examining every part of the XPRTs. Recently, we talked a bit about how people will be using their devices in new ways in the coming months, and we need to be developing tests for these new applications. However, we must also stay focused on keeping existing tests fresh.  People will keep taking photos, but today’s photo editing tests may not be relevant a year or two from now.

Were there any announcements yesterday that got you excited? Let us know!

Eric

Apples to apples?

PCMag published a great review of the Opera browser this week. In addition to looking at the many features Opera offers, the review included performance data from multiple benchmarks, which look at areas such as hardware graphics acceleration, WebGL performance, memory consumption, and battery life.

Three of the benchmarks have a significant, though not exclusive, focus on JavaScript performance: Google Octane 2.0, JetStream 1.1, and WebXPRT 2015. The three benchmarks did not rank the browsers the same way, and in the past, we‘ve discussed some of the reasons why this happens. In addition to the difference in tests, there are also sometimes differences in approaches that are worth considering.

For example, consider the test descriptions for JetStream 1.1. You’ll immediately notice that the tests are much lower-level tests than the ones in WebXPRT. However, consider these phrases from a few of the test descriptions:

  • code-first-load “…This test attempts to defeat the browser’s caching capabilities…”
  • splay-latency “Tests the worst-case performance…”
  • zlib “…modified to restrict code caching opportunities…”

 

While the XPRTs test typical performance for higher level applications, the tests in JetStream are tweaked to stress devices in very specific ways, some of which are not typical. The information these tests provide can be very useful for engineers and developers, but may not be as meaningful to the typical user.

I have to stress that both approaches are valid, but they are doing somewhat different things. There’s a cliché about comparing apples to apples, but not all apples are the same. If you’re making a pie, a Granny Smith would be a good choice, but for snacking, you might be better off with a Red Delicious. Knowing a benchmark’s purpose will help you find the results that are most meaningful to you.

Eric

Getting it right

Back in April Bill announced that we are working on a cross-platform benchmark. We asked for your thoughts and comments, and you’ve been great! We really appreciate all the great ideas.

We’ve been using code from MobileXPRT and TouchXPRT as the basis for some experiments. In his post, Bill talked about the difficulty of porting applications. However, even though we have expertise in porting applications, it’s proving more difficult than we originally thought. Benchmarks are held to a higher standard than most applications. It’s not enough for the code to run reliably and efficiently, it must compare the different platforms fairly.

One thing we know for sure: getting it right is going to take a while. However, we owe it to you to make sure that the benchmark is reliable and fair on all platforms it supports. We will, of course, keep you informed as things progress.

In the meantime, keep sending your ideas!
Eric

Check out the other XPRTs:

Forgot your password?