The Tale of Troubleshooting: Unstable Builds and Open Source Infrastructure

xkcd comic - Someday ImageMagick will finally break for good and we'll have a long period of scrambling as we try to reassemble civilization from the rubble.

Introduction#

Donald Knuth famously wrote that Premature Optimization is the root of all evil. I, for one, believe that all evil comes from spuriously failing builds. Nothing steals my confidence in a project as quickly as unstable builds alternating between green and red for no reason. This is a story about unstable builds and troubleshooting. More importantly, this story is written to thank all contributors to basic software infrastructure - the infrastructure we all use and take for granted.

Surprise in logs#

Upon logging into Azure Pipelines to review the logs of multiple failed builds, I mentally braced myself for a potentially arduous troubleshooting session. I suspected that a race condition was the culprit that caused non-deterministic outcomes. Therefore, I was surprised to discover the actual reasons for the recent build failures. They were all similar to this:

2023-02-22T13:57:55.6111290Z [ERROR] Plugin org.apache. Maven.plugins:maven-clean-plugin:3.2.0 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-clean-plugin:jar:3.2.0: Could not transfer artifact org.apache.maven.plugins:maven-clean-plugin:pom:3.2.0 from/to central (https://repo.maven.apache.org/maven2): Connection reset -> [Help 1]
2023-02-22T13:57:55.6113250Z [ERROR]
2023-02-22T13:57:55.6126660Z [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
2023-02-22T13:57:55.6127480Z [ERROR] Re-run Maven using the -X switch to enable full debug logging.
2023-02-22T13:57:55.6127880Z [ERROR]
2023-02-22T13:57:55.6128720Z [ERROR] For more information about the errors and possible solutions, please read the following articles:
2023-02-22T13:57:55.6129520Z [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException

This was the most interesting part: Could not transfer artifact org.apache.maven.plugins:maven-clean-plugin:pom:3.2.0 from/to central (https://repo.maven.apache.org/maven2): Connection reset -> [Help 1]

The error comes from the Apache Maven project, which we use to build QuestDB. Maven complained that it could not download the Maven Clean plugin from the Central repository. If this had just been a one-off error, I would have played it down as a network hiccup. However, we were getting the same error multiple times a day!

The investigation, part I#

I am no Maven expert, so I did what every developer would do when troubleshooting a rather complex piece of technology: I Googled it!

A picture of Yoda saying: "Use Google, You Must. Easy, it is. Answer, it shall provide.

Eventually, I stumbled upon this issue.

It says that Azure Pipelines kill idle connections, which does not play nicely with the habit of Maven to pool and re-use HTTP connections. This is consistent with what I saw in builds logs: Maven started tests, which ran for 20+ minutes and passed successfully. Then Maven wanted to download the Clean Plugin to do something else, but the download attempt failed, which killed the whole build.

There was still one question left unanswered: Why did we start getting this error only recently? Well, I had an explanation for this, too: We kept adding tests, so they ran longer. Azure Pipelines killed connections because the total test duration exceeded the idle connection timeout. It all started to make sense.

I also found a Maven configuration property to restrict the Time-To-Live of connections in the pool. The Maven issue I found says:

"Azure users shall set the TTL to 240 seconds or less."

So I sent a Pull Request to set TTL to 30s, announced it in the company Slack, and felt good about myself :)

A screenshot from Slack where I say: the Maven connection reset issues should be fixed for all new PR builds. Let me know if you experience it again.

It took less than 30 minutes, and we got another failure: A screenshot from Slack where my colleague says: It didn't help

The failure had the same cause as before. The fix did not work. Back to square one.

The investigation, part II#

Googling blindly combined with a bit of logical reasoning failed to produce results. There was only one answer: WE NEED MOAR LOGS! Running Maven with verbose logs produced an exception with this beautiful stack trace:

    [...]
    2023-02-23T11:20:01.6052340Z Caused by: java.net.SocketException: Connection reset
    2023-02-23T11:20:01.6052680Z at java.net.SocketInputStream.read (SocketInputStream.java:186)
    2023-02-23T11:20:01.6053040Z at java.net.SocketInputStream.read (SocketInputStream.java:140)
    2023-02-23T11:20:01.6053580Z at sun.security.ssl.SSLSocketInputRecord.read (SSLSocketInputRecord.java:484)
    2023-02-23T11:20:01.6054000Z at sun.security.ssl.SSLSocketInputRecord.readHeader (SSLSocketInputRecord.java:478)
    2023-02-23T11:20:01.6055020Z at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket (SSLSocketInputRecord.java:70)
    2023-02-23T11:20:01.6055500Z at sun.security.ssl.SSLSocketImpl.readApplicationRecord (SSLSocketImpl.java:1455)
    2023-02-23T11:20:01.6055900Z at sun.security.ssl.SSLSocketImpl$AppInputStream.read (SSLSocketImpl.java:1066)
    2023-02-23T11:20:01.6056310Z at org.apache.http.impl.io.SessionInputBufferImpl.streamRead (SessionInputBufferImpl.java:137)
    2023-02-23T11:20:01.6056730Z at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer (SessionInputBufferImpl.java:153)
    2023-02-23T11:20:01.6057150Z at org.apache.http.impl.io.SessionInputBufferImpl.readLine (SessionInputBufferImpl.java:280)
    2023-02-23T11:20:01.6059200Z at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead (DefaultHttpResponseParser.java:138)
    2023-02-23T11:20:01.6059980Z at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead (DefaultHttpResponseParser.java:56)
    2023-02-23T11:20:01.6060470Z at org.apache.http.impl.io.AbstractMessageParser.parse (AbstractMessageParser.java:259)
    2023-02-23T11:20:01.6060940Z at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader (DefaultBHttpClientConnection.java:163)
    2023-02-23T11:20:01.6061370Z at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader (CPoolProxy.java:157)
    2023-02-23T11:20:01.6061790Z at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse (HttpRequestExecutor.java:273)
    2023-02-23T11:20:01.6062210Z at org.apache.http.protocol.HttpRequestExecutor.execute (HttpRequestExecutor.java:125)
    2023-02-23T11:20:01.6062610Z at org.apache.http.impl.execchain.MainClientExec.execute (MainClientExec.java:272)
    2023-02-23T11:20:01.6063000Z at org.apache.http.impl.execchain.ProtocolExec.execute (ProtocolExec.java:186)
    2023-02-23T11:20:01.6063370Z at org.apache.http.impl.execchain.RetryExec.execute (RetryExec.java:89)
    2023-02-23T11:20:01.6063760Z at org.apache.http.impl.execchain.RedirectExec.execute (RedirectExec.java:110)
    2023-02-23T11:20:01.6064170Z at org.apache.http.impl.client.InternalHttpClient.doExecute (InternalHttpClient.java:185)
    2023-02-23T11:20:01.6064580Z at org.apache.http.impl.client.CloseableHttpClient.execute (CloseableHttpClient.java:72)
    2023-02-23T11:20:01.6065000Z at org.eclipse.aether.transport.http.HttpTransporter.execute (HttpTransporter.java:363)
    2023-02-23T11:20:01.6065400Z at org.eclipse.aether.transport.http.HttpTransporter.implGet (HttpTransporter.java:298)
    2023-02-23T11:20:01.6065820Z at org.eclipse.aether.spi.connector.transport.AbstractTransporter.get (AbstractTransporter.java:72)
    2023-02-23T11:20:01.6066260Z at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:466)
    2023-02-23T11:20:01.6066720Z at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:383)
    2023-02-23T11:20:01.6067180Z at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:73)
    2023-02-23T11:20:01.6067660Z at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute (BasicRepositoryConnector.java:635)
    2023-02-23T11:20:01.6068110Z at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get (BasicRepositoryConnector.java:280)
    2023-02-23T11:20:01.6068560Z at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads (DefaultArtifactResolver.java:581)
    2023-02-23T11:20:01.6069000Z at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve (DefaultArtifactResolver.java:478)
    2023-02-23T11:20:01.6069440Z at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts (DefaultArtifactResolver.java:278)
    2023-02-23T11:20:01.6069880Z at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact (DefaultArtifactResolver.java:255)
    2023-02-23T11:20:01.6070350Z at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.loadPom (DefaultArtifactDescriptorReader.java:228)
    2023-02-23T11:20:01.6071370Z at org.apache.maven.repository.internal.DefaultArtifactDescriptorReader.readArtifactDescriptor (DefaultArtifactDescriptorReader.java:169)
    2023-02-23T11:20:01.6071950Z at org.eclipse.aether.internal.impl.DefaultRepositorySystem.readArtifactDescriptor (DefaultRepositorySystem.java:286)
    2023-02-23T11:20:01.6072420Z at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve (DefaultPluginDependenciesResolver.java:104)
    2023-02-23T11:20:01.6072890Z at org.apache.maven.plugin.internal.DefaultMavenPluginManager.getPluginDescriptor (DefaultMavenPluginManager.java:180)
    2023-02-23T11:20:01.6073360Z at org.apache.maven.plugin.internal.DefaultMavenPluginManager.getMojoDescriptor (DefaultMavenPluginManager.java:265)
    2023-02-23T11:20:01.6073810Z at org.apache.maven.plugin.DefaultBuildPluginManager.getMojoDescriptor (DefaultBuildPluginManager.java:214)
    2023-02-23T11:20:01.6074310Z at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.setupMojoExecution (DefaultLifecycleExecutionPlanCalculator.java:155)
    2023-02-23T11:20:01.6074830Z at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.setupMojoExecutions (DefaultLifecycleExecutionPlanCalculator.java:143)
    2023-02-23T11:20:01.6075360Z at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateExecutionPlan (DefaultLifecycleExecutionPlanCalculator.java:122)
    2023-02-23T11:20:01.6075950Z at org.apache.maven.lifecycle.internal.DefaultLifecycleExecutionPlanCalculator.calculateExecutionPlan (DefaultLifecycleExecutionPlanCalculator.java:135)
    2023-02-23T11:20:01.6076450Z at org.apache.maven.lifecycle.internal.builder.BuilderCommon.resolveBuildPlan (BuilderCommon.java:93)
    2023-02-23T11:20:01.6076950Z at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:100)
    2023-02-23T11:20:01.6077460Z at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
    2023-02-23T11:20:01.6078010Z at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
    2023-02-23T11:20:01.6078510Z at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
    2023-02-23T11:20:01.6078910Z at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260)
    2023-02-23T11:20:01.6079260Z at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172)
    2023-02-23T11:20:01.6080180Z at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100)
    2023-02-23T11:20:01.6080630Z at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821)
    2023-02-23T11:20:01.6081080Z at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270)
    2023-02-23T11:20:01.6081450Z at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
    2023-02-23T11:20:01.6081790Z at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    2023-02-23T11:20:01.6082520Z at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    2023-02-23T11:20:01.6083070Z at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    2023-02-23T11:20:01.6083580Z at java.lang.reflect.Method.invoke (Method.java:566)
    2023-02-23T11:20:01.6083980Z at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    2023-02-23T11:20:01.6084390Z at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    2023-02-23T11:20:01.6084800Z at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    2023-02-23T11:20:01.6085190Z at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)

It looks like a typical lengthy Java stack trace. I noticed that that there was nothing about "Wagon". I knew Wagon was a Maven component to download dependencies. Even my first fix attempt configured a Wagon-related property: -Dmaven.wagon.httpconnectionManager.ttlSeconds=30

So it was suspicious that I could not see anything Wagon-related in the exception stack trace. At this point, I resorted to the 3rd level of troubleshooting: Reading the source code! The Maven project consists of many components; for a non-expert, it's not simple to grasp. Eventually, I made an interesting discovery: Maven dependency resolver supports pluggable (network) transports. Digging deeper, I found 2 Transport implementations that can provide HTTP access:

The Wagon implementation I knew about
A new “Native” implementation

The Wagon implementation uses the Apache HTTP Client internally, and it configures the client with a retry handler. There are different retry strategies available. What is important is that the Wagon transport uses three retries by default:

private static CloseableHttpClient createClient()
    {
        return HttpClientBuilder.create() //
            .useSystemProperties() //
            .disableConnectionState() //
            .setConnectionManager( httpClientConnectionManager ) //
            .setRetryHandler( createRetryHandler() )
            .setServiceUnavailableRetryStrategy( createServiceUnavailableRetryStrategy() )
            .setDefaultAuthSchemeRegistry( createAuthSchemeRegistry() )
            .setRedirectStrategy( new WagonRedirectStrategy() )
            .build();
    }
[...]

private static HttpRequestRetryHandler createRetryHandler()
  {
      switch ( RETRY_HANDLER_CLASS )
      {
          case "default":
              if ( StringUtils.isEmpty( RETRY_HANDLER_EXCEPTIONS ) )
              {
                  return new DefaultHttpRequestRetryHandler(
                          RETRY_HANDLER_COUNT, RETRY_HANDLER_REQUEST_SENT_ENABLED );
              }
              return new DefaultHttpRequestRetryHandler(
                      RETRY_HANDLER_COUNT, RETRY_HANDLER_REQUEST_SENT_ENABLED, getNonRetryableExceptions() )
              {
              };
          case "standard":
              return new StandardHttpRequestRetryHandler( RETRY_HANDLER_COUNT, RETRY_HANDLER_REQUEST_SENT_ENABLED );
          default:
              try
              {
                  final ClassLoader classLoader = AbstractHttpClientWagon.class.getClassLoader();
                  return HttpRequestRetryHandler.class.cast( classLoader.loadClass( RETRY_HANDLER_CLASS )
                                                                        .getConstructor().newInstance() );
              }
              catch ( final Exception e )
              {
                  throw new IllegalArgumentException( e );
              }
      }
  }

The source code of the “Native” transport implementation is directly in the Maven Resolver project. Judging from the class and package names, it is clear that this is the transport I saw in the exception stack traces on the build server. It also uses the Apache HTTP Client internally but does not configure any retrying strategy!

this.client = HttpClientBuilder.create()
        .setUserAgent(userAgent)
        .setDefaultSocketConfig(socketConfig)
        .setDefaultRequestConfig(requestConfig)
        .setDefaultAuthSchemeRegistry(authSchemeRegistry)
        .setConnectionManager(state.getConnectionManager())
        .setConnectionManagerShared(true)
        .setDefaultCredentialsProvider(toCredentialsProvider(server, repoAuthContext, proxy, proxyAuthContext))
        .setProxy(proxy)
        .build();

It appears that Maven is failing quickly instead of attempting to retry when a connection is terminated because the Native transport protocol does not attempt to retry and immediately reports network errors. However, why did this issue appear only recently? I opened Maven 3.9 release notes once again. The paragraph “Potentially Breaking Core Changes” did not have anything suspicious. Luckily, I kept reading and saw this: “Switchable and alternative resolver transports included, with default switched to native transport.” Bingo! This must be it.

Here was my hypothesis: We invoked Maven via the Azure Pipelines Maven Task and we did not specify Maven version explicitly. Hence, it used the default version, which was recently upgraded to Maven 3.9. From the release notes, we learned that Maven 3.9 switched to the new Native transport that does not support retrying. It all started to make sense.

How to test this hypothesis? There were two options:

Downgrade Maven to version pre-3.9
Force Maven 3.9 to use the Wagon transport instead of the new Native transport

While browsing Maven source code, I stumbled upon a strategy for selecting transports:

private static final String MAVEN_RESOLVER_TRANSPORT_KEY = "maven.resolver.transport";
private static final String MAVEN_RESOLVER_TRANSPORT_DEFAULT = "default";
private static final String MAVEN_RESOLVER_TRANSPORT_WAGON = "wagon";
private static final String MAVEN_RESOLVER_TRANSPORT_NATIVE = "native";
private static final String MAVEN_RESOLVER_TRANSPORT_AUTO = "auto";
[...]

Object transport = configProps.getOrDefault(MAVEN_RESOLVER_TRANSPORT_KEY, MAVEN_RESOLVER_TRANSPORT_DEFAULT);
        if (MAVEN_RESOLVER_TRANSPORT_DEFAULT.equals(transport)) {
            // The "default" mode (user did not set anything) from now on defaults to AUTO
        } else if (MAVEN_RESOLVER_TRANSPORT_NATIVE.equals(transport)) {
            // Make sure (whatever extra priority is set) that resolver native is selected
            configProps.put(NATIVE_FILE_TRANSPORTER_PRIORITY_KEY, RESOLVER_MAX_PRIORITY);
            configProps.put(NATIVE_HTTP_TRANSPORTER_PRIORITY_KEY, RESOLVER_MAX_PRIORITY);
        } else if (MAVEN_RESOLVER_TRANSPORT_WAGON.equals(transport)) {
            // Make sure (whatever extra priority is set) that wagon is selected
            configProps.put(WAGON_TRANSPORTER_PRIORITY_KEY, RESOLVER_MAX_PRIORITY);
        } else if (!MAVEN_RESOLVER_TRANSPORT_AUTO.equals(transport)) {
            throw new IllegalArgumentException("Unknown resolver transport '" + transport
                    + "'. Supported transports are: " + MAVEN_RESOLVER_TRANSPORT_WAGON + ", "
                    + MAVEN_RESOLVER_TRANSPORT_NATIVE + ", " + MAVEN_RESOLVER_TRANSPORT_AUTO);
        }

I figured out that forcing the old Wagon transport was as simple as passing -Dmaven.resolver.transport=wagon to my build command. I added the flag and waited eagerly for the build results. And it was a success! The first build passed, then the second, third, and fourth. All builds were green! Such a relief. Our precious builds were deterministic again. Problem Solved! Or not?

Contributing Back#

We all consume open source. Apache Maven itself is open source. It’s so ubiquitous that it is impossible to find a Java developer who has never used it. Maven is a basic software infrastructure that we all take for granted. It’s not backed by a large corporation or by a VC-funded startup. Volunteers maintain it. I felt the least I could do was to report my experience to the Maven team, so they could clarify whether it was an intentional behavior or a bug. I opened a ticket and wanted to move on.

The same evening, I saw one of the Maven maintainers create a new issue in the Maven Resolver project and link it to the Maven issue I created earlier. I took it as a confirmation that the behavior was not intentional and started to poke the Maven Resolver source code once again. Only this time with the intention to fix it.

It turned out that the fix was really easy. It was a matter of creating a retry handler and setting the desired retry count. That was exactly what the old Wagon plugin was doing. A few lines got changed. Then I was wondering how to test it. None of the existing Resolver tests exercises scenarios where a connection to a remote repository fails. The existing tests use a Jetty-based server mimicking a remote repository. So I figured out I could add a fault injection: A Jetty handler which force-closes a predefined number of connections. And then test if the resolver can still access the server. Again, a few lines of code:

private class ConnectionClosingHandler extends AbstractHandler {
    public void handle(String target, Request req, HttpServletRequest request, HttpServletResponse response) {
        if (connectionsToClose.getAndDecrement() > 0) {
            Response jettyResponse = (Response) response;
            jettyResponse.getHttpChannel().getConnection().close();
        }
    }
}

This infrastructure allowed me to write tests similar to this:

@Test
public void testRetryHandler_defaultCount_positive() throws Exception {
  httpServer.setConnectionsToClose(3);
  transporter.peek(new PeekTask(URI.create("repo/file.txt")));
}

Easy-peasy!

I sent the Pull Request and was ready to forget about it. Merging a PR to new projects can take a while. To my surprise, the maintainers reviewed the PR less than 1 hour after the submission and merged it the next morning. Wow! This is the way to attract new contributors! I admire all people maintaining projects like Maven. They do it in their free time, and I reckon they mostly hear from users when something is wrong and when users are upset. They deserve recognition for what they do. Tamas Cservenak and Slawomir Jaranowski helped me with the PR. If you are reading this: Thank you!

What is the moral of the story?

Troubleshooting is fun.
Read release notes carefully.
Be grateful to the maintainers of basic infrastructure.
Contributing back feels good!

The Tale of Troubleshooting: Unstable Builds and Open Source Infrastructure

Jaromir Hamala

Introduction#

Surprise in logs#

The investigation, part I#

The investigation, part II#

Contributing Back#

Join our developer community

Subscribe to our newsletter