localTimeout error on Nodes

bs1 · 3 August 2023 15:46

Hello!

We updated our farm (20 nodes including workstations) to v.2.2.6 a few days ago, and we are getting a handful of errors every day when executing various commands. For instance, the command “suspend job” often does not work properly (red error message in bottom right of interface), and the information for the error will list which node on the farm responded with an error. The error is always “localtimout:15000”.

It is not always the same node, sometimes it is more than one node, and sometimes it doesn’t happen at all.

Restarting Pulze will sometimes fix the problem, but even that command cannot always be executed properly from another workstation. We need to remote into the problem node and manually restart Pulze.

These “localtimeout” errors cause other issues because the command is only partially executed, like in the case of trying to suspend a job. The job will be suspended, but a node with the “localtimeout” error will still try to load the job, leading to more issues because the distributor then tries to assign that node to a different job. And then the whole thing breaks!

Nothing in our network changed, and these errors did not happen before the 2.2.6 update.

Has anyone else had issues like this? Is there anything we can do to mitigate the timeout errors?

Thank you!

peter.sarhidai · 11 August 2023 11:39

Hi @bs1

If you can send us some logs to support@pulze.io we can help you out and check why you receive the timeout issue.