Reliability is one of the key drivers of our product development at FIFTY2. Users of PreonLab can always be sure, that new versions have been thoroughly tested before we release them. On the other hand, our quality awareness should not slow us down implementing new features. In this article we show how our development process is structured such that we don’t sacrifice one goal for the other.
We continuously validate the simulation results and the application behavior. This means, that every new feature is carefully looked at. Our engineering team analyzes the result and validates physical aspects. New development should also not break existing workflows. There are a lot of different aspects of quality assurance, which in combination makes PreonLab an enjoyable user experience.
At FIFTY2, we have different stages where different variants of software testing is strictly incorporated in the development process. Like this, we minimize the risk of regressions and side effects and ensure reliable simulation results. It is always our goal to ensure, that the PreonLab version deployed to our users is the best version so far.
Let’s go one step back and look at how our code gets developed. A new release is put together in a timeframe of a few months. During that time our codebase evolves until the new extended feature set is covered. Of course all those changes need to be made in a structured way. It’s not possible to develop all features and put them together at the end. The changes would not fit together well. Serializing the development and implementing feature after feature does also not work, because FIFTY2 is no one man show. It is worked on many features by different people at the same time.
Figure 1: The main branch contains all peer reviewed changes that were developed in the feature branches.
In our development process every feature or change is developed in a separate space, a so-called feature branch. If a change is ready, stable and tested it will be merged back into the main branch, the so called develop. This way each day new features, fixes and other changes land in our main branch.
Working as a team on one shared codebase can easily result in a lot of problems. Even when people work on different parts, things can go wrong. Let’s take the Statistics system as an example, which will get some new functionality in version 5.1. One developer designed the new feature and made sure it works in all places. Now someone else writes some new code, e.g. for the solver, and wants to use this Statistics module. Even if both changes alone are well tested and are perfectly working, together they might result in code that is not even compiling. Of course such an issue would be very easy to fix for either of the two developers, but now a third developer that just got the latest code from our team repository has to find out why the heck PreonLab is not compiling. And we all agree that he or she should rather work on amazing new features, right?
To overcome the resulting time-consuming manual work to keep the code consistent and compatible, certain techniques emerged in the software development industry. One first building block is called Continuous Integration or CI. This means that as soon as one developer makes a change, it might be as small as a fix of a typo, and puts it on our team repository, some basic properties are checked. The important part here is that if everything goes well no manual work is needed. The developers only needs to take action if something broke and, this is also important, they get these notifications before the changes are merged into the main branch.
Figure 2: For each change that was pushed to our team repository, a so-called pipeline is triggered. Each pipeline checks certain properties. If something goes wrong the developer can react before the issue shows up in the main branch.
Such CI jobs run roughly 15-25 times on a regular workday for our PreonLab codebase. With each run we check different aspects or our code. The most important one, in terms of the amount of time we save, is that the whole code compiles on all supported platforms. Also, it is checked that certain files have the correct format. This includes our C++ code, but also documentation. And finally we run our unit test suit, which is our first level of making sure our code does the right thing. Here we take different pieces of code, so-called units, and make sure they work correctly in isolation.
Each change is tested by our CI which ensures a base level of stability. Every night, this state is then compiled and packaged, ready for our Application Engineering Team and our partners at AVL to be used. It is even installed on our MPI cluster, such that it is as easy as possible to be used in the next simulation. This next level of Continuous Integration is called Continuous Deployment. One day we made a change to our code base, the next day Application Engineering can use it and provide feedback. With each iteration PreonLab gets better and better.
As explained above, we automatically run unit tests for each change in our CI system. It is important to test individual parts in isolation in order to maintain high quality code. But even if all parts work on their own according to the test specification, they also have to work nicely together.
For our customers PreonPy is a nice and powerful tool to automate repeating or complex tasks of the scene setup or the evaluation of simulations. For us, it is also a very valuable tool to write integration tests. While our unit tests verify single code units, our integration tests handle complete features or even combinations of features. Here we make sure that the result of a simulation keeps correct, even if we change the underlying code. We check that old scenes still load correctly. And it allows developers to create features without the fear of destroying existing ones.
In total, we have roughly 500 single test cases and the number is steadily growing. We also run them on different platforms including a variant that runs all simulations with MPI. Our default test set runs about an hour and is triggered a few times per day. This is at least done right before merging code changes back to our develop branch (in addition to the regular CI checks). Often, especially for larger features, it runs several times. In order to get fast feedback it is also possible to only run a subset of tests.
Figure 3: In addition to unit tests that are run automatically, our integration test suite can be triggered. One regular run takes approximately one hour, but it’s also possible to select single tests. Every night our full test set runs for 5 hours.
There are some tests that run even longer and because of that are not part of the default test set. Here we have conflicting goals. On one hand we want to have fast feedback, on the other hand we want to run as many tests as possible to catch bugs as early as possible. We run those “slow” tests every night for the main branch. If such a test fails, we can still inform our application engineering team about it and can take action on the following day. This strategy works extremely well for us.
Another part of our test suite are performance tests. The main problem here is, that there is no “wrong” result. We know that fast is good and faster is even better, but when exactly is slow “too slow”? Another issue is that, while we are seeking deterministic simulation results, there is always some noise in the runtime measurements. This makes the automated assessment of benchmarking results even harder.
For a long time we had hardcoded target values included in our tests. These included some buffer, such that expected noise does not lead to “failed” tests all along. This technique already helped a lot. Obvious performance regressions got caught early and did not even land in develop. Over time, we saw one major problem with this approach. We got notified only if our code got slower, but not when our code got faster. Consequentially those target values never got lowered. Thus, performance gains could be eaten up again, silently. Figure 4 shows the problem. When we looked at those values in detail and compared them against how our current code performs, we saw speedups up to the factor of two.
Figure 4: Performance regression 1 and 2 could be caught by comparing them to the hardcoded target value (shown as the blue line). After some time the code got such fast, that regression 3 could not be detected. Our current approach compares against a dynamic, history-based target value. It adapts to performance improvements automatically such that cases like regression 3 can be caught.
To overcome this problem we gave up on the hardcoded target values and implemented a system for tracking those performance metrics. Having a history over time allows us to have a much more detailed picture of how performance of specific parts of our code has developed. We can also measure the noise we have in our system. For our goal to automatically find performance regressions in new code we also have many possibilities. We decided to compare against the mean performance of a 2 weeks window. The allowed deviation then also depends on the noise we measured over time. With this system our benchmarks recalibrate automatically and achieved gains become the new baseline within a few days. Figure 5 shows an example from our benchmarking system.
Figure 5: This shows benchmark results that were gathered over time (lower is better). You can see that end of last year and in April we made changes that had an impact on our benchmark. The grey horizontal line shows the average result of the last two weeks. We treat it to be our goal. To be robust against noise, we incorporate the variation of the results and compare new runs against a value that is a bit higher (shown as red line).
All the above measures already ensure a high quality throughout the whole software development process. However, they cannot replace the testing of the software by human beings. Our application engineering team is accompanying the planning and the development of PreonLab closely. New features can be tested in each and every development step using so called Nightly Builds which reflect the latest state of the software. Furthermore, we deploy dedicated beta versions to a wider range of beta testers ahead of every release. This allows us to get feedback from users with different hardware, knowledge, workflows and a large variety of applications challenges.
Finally, every new PreonLab version has to pass a broad range of manual tests defined in a continuously growing protocol. Application engineers and developers join forces to go through the protocol, covering every feature supported by PreonLab. After every round, identified issues are fixed and the process is repeated if necessary.
Even while developing a specific feature, this is accompanied by creating tests to ensure the stability and the functionality of it.
Even though we take all the above measures, software can never be without any bugs. When a bug is reported to us, we take this serious and make sure that it is treated as fast and thoroughly as possible. Furthermore, tests are created in order to prevent similar problems to be introduced again in the future.
Thus, the number of automatic and manual tests is continuously increasing in order to ensure that we meet the high expectations of our users worldwide.