Hello everyone! I’m @jye, a QA engineer at Mercari. This post is for Day 24 of Mercari Advent Calendar 2023.
In Mercari, QA engineer not only assists the development team with testing during the development cycle, but also responds to automation E2E tests on all platforms (iOS, Android, Web, and API).
Recently, we have made an update to our automation end-to-end (E2E) test system for the Web platform. In the old system we had encountered several issues including problems with remote browser connections, problematic retry mechanisms in certain situations, and missing test cases in the report. In the following section, I will introduce the changes that were made and explain the reasons behind them.
About the renovation, we have made two significant changes for the Web E2E test system. First, we have transitioned our test framework from Jest-playwright to Playwright. Secondly, we change the architecture for the remote browser and the CI platform. It was changed from running the regression test on CircleCI with the remote browsers which were deployed by Moon, to the Github Actions self-hosted runner which is deployed in the internal kubernetes cluster with the Playwright supported browser binary.
About the old E2E test system
Architecture diagram for the old E2E system
Originally, we used Jest-playwright and ran it on CircleCI. In order to connect to our Web dev environment, we needed to allow the access from external CircleCI IPs, but due to security concerns, we couldn’t whitelist all of CircleCI’s IPs. Therefore, we found a solution which was using Moon, a service that helps to deploy browsers in the kubernetes cluster. So CircleCI was only responsible for running the E2E code and it connected to the remote browsers which were in the internal cluster, therefore, the browser can access to our Web dev environment.
The problems of the old E2E system
We have been using our old E2E system for three years, and it has been incredibly useful to run regression tests before releasing the new version of Mercari Web to production. Additionally, the report assists us in tracking and analyzing the flaky tests with every test run. However, as time passed and the number of test cases increased, we gradually discovered various problems.
1. Jest-playwright is out of date
Over the years, Playwright has become matured over these years. However, Jest-playwright has slowed down its support for adding new features and has now announced that they recommend using native Playwright as the test framework.
When we started to build the old E2E system, we chose Jest-playwright because Playwright had limited feature support for writing test cases at that time. Moreover, our developers were already familiar with the popular test framework Jest, making it quicker to build Jest-like UI tests using Jest-playwright. However, Playwright has incorporated more commonly used test functions and features for UI E2E testing. We will need to change the framework to get more flexibility and optimized features for our E2E test.
2. Remote browser connection issues
Another issue we encountered was with the remote browsers provided by Moon. Since the browsers are controlled by another service within the cluster, the browsers are not normally launched in large numbers. However, for E2E test with a high number of cases, parallel execution is often required, which leads to a high number of connections. Optimizing the pod resources to handle this efficiently is not straightforward. Additionally, each test case needs to wait for a browser connection to start executing, which ultimately slows down the overall execution speed of individual E2E test. Some cases even fail to execute because the browser connection cannot be established within the given timeout.
3. Problematic retry mechanism in certain situations
In the old E2E system we wanted to use the jest.retryTimes
option to retry failed tests, but the reporting library that we were using called "Jest-allure" only worked with the "Jest-Jasmine2" test runner, which in turn did not support the jest.retryTimes
option. Instead of that, it provides a command line option called --onlyFailures
which allows the execution of only the failed cases from the previous run based on the status cache.
For example:
npm run test ||
npm run test --onlyFailures ||
npm run test --onlyFailures
This option seems like a viable alternative for retry. However, it’s critical that if the test case fails due to a remote browser connection issue, Jest will not record those tests in the status cache. As a result, these test cases will not be retried in the subsequent runs with the command line option.
4. Some test cases were missing in the report
As mentioned previously, we use the report library called "Jest-allure" which generates the report based on the latest test run. This means that if there is a remote browser connection issue during the test run, those test cases will never appear in the report. This can be quite confusing when checking the report. In the worst-case situation, when the Moon environment is unstable, there is a possibility of losing over 50% of the tests in a single end-to-end run. This instability can greatly impact the reliability and completeness of the test results.
Example for missing the test record in test report
The main challenge and the solution
The most challenging part is not updating the framework or refactoring the code. It’s actually keeping our old E2E system running, as it is an important check before the release and engineers also confirm regression by running the E2E test. The migration will take more than a few days, so we can’t just stop our E2E tests and make everyone wait until the framework migration is done. Additionally, development for the web is ongoing, so we also need to keep our page object elements and test cases up to date during the migration period.
Due to the heavy usage of the E2E tests every week to ensure the stability of our web application in each release, we have made the decision to create a new E2E repository. During the migration period, we will need to update the elements and test cases for both the old and new repositories to maintain their functionality. However, this decision gives us more flexibility to implement all desired changes in the new repository without affecting the current usage of the E2E tests.
The solution to the first issue is relatively straightforward. We just need to update the style and function to use Playwright. Once we finish setting up the necessary configuration, we can start assigning the test cases to our team members. Their task will involve making the required changes and ensuring that all test cases can be successfully executed using the new style with Playwright.
Regarding the second issue, our CI/CD team has started providing a self-hosted runner that is built within our network. This means we can now use the Playwright built-in browser binary and are no longer limited to using Moon. So we can just create some GitHub Actions workflow to make our E2E test running on the self-hosted runner.
As for the third issue, since we have recently started using Playwright, we can easily switch to using its built-in retry mechanism. We can achieve this by applying the necessary configuration changes in the corresponding config file.
Example for playwright.config
const config: PlaywrightTestConfig = {
retries: 2,
}
And finally, for the missing test cases in the report, we can actually resolve it by using Playwright’s built-in browser binary on the self-hosted runner. Since there are no more connection issues to the Moon, the missing test case in the report problem is automatically solved. However, we still plan to leverage the HTML report provided by Playwright to improve the visibility of the test results. As part of this plan, we also create a CI workflow that stores the report in cloud storage and hosts it as a static page. This way, everyone will have easier access to view the report and track the test results.
As a result, not only have we successfully migrated our library, but we have also resolved the issues present in the old E2E test system. The performance has improved, and we have even managed to reduce costs by eliminating the need for the Moon license.
Architecture diagram after the migration
Conclusion
The overall migration took around half a year. Because the QA team will need to mainly help with other teams development testing and will use the rest of time working on automation improvement.
Although the system update did not involve using any latest new technologies, it effectively addressed the long-standing problems. With the enhanced capabilities offered by Playwright, we expect our utilization of the E2E test system to become even more flexible. We hope to have the opportunity to share further improvements and new measures for E2E test systems in the future.
Additionally, thanks to the CI/CD team providing the internal self-hosted runner service. This has greatly facilitated CI processes that typically require careful consideration of security concerns.
Tomorrow is the final article of the Advent Calendar 2023 by kimuras, CTO of Mercari. Look forward to it!