The case of disappearing PulseAudio daemon

03.11.2018 19:39

Recently I was debugging an unusual problem with the PulseAudio daemon. PulseAudio handles high-level details of audio recording, playback and processing. These days it is used by default in many desktop Linux distributions. The problem I was investigating was on an embedded device using Linux kernel 4.9. I relatively quickly I found a way to reproduce it. However finding the actual cause was surprisingly difficult and led me into the kernel source and learning about the real-time scheduler. I thought I might share this story in case someone else finds themselves in a similar situation.

The basic issue was that PulseAudio daemon occasionally restarted. This reset some run-time configuration that should not be lost while the device was operating and also broke connections to clients that were talking to the daemon. Restarts were seemingly connected to the daemon or the CPU being under load. For example, the daemon would restart if many audio files were played simultaneously on it. I could reproduce the problem on PulseAudio 12.0 using a shell script similar to the following:

n=0
while [ $n -lt 100 ]; do
	pacmd play-file foo.wav 0
	n=$(($n+1))
done

This triggers 100 playbacks of the foo.wav file at almost the same instant and would reliably make the daemon restart on the device. However I was sometimes seeing restarts with less than ten simultaneous audio plays. A similar script with 1000 plays would sometimes also cause a restart on my Intel-based laptop using the same PulseAudio version. This made it easier to investigate the problem since I didn't need to do the debugging remotely on the device.

syslog held some clues what was happening. PulseAudio process was apparently being sent the SIGKILL signal. systemd detected that and restarted the service shortly after:

systemd[1]: pulseaudio.service: Main process exited, code=killed, status=9/KILL
systemd[1]: pulseaudio.service: Unit entered failed state.
systemd[1]: pulseaudio.service: Failed with result 'signal'.

However, there were no other clues as to what sent the SIGKILL or why. Kernel log from dmesg had nothing whatsoever related to this. Increasing logging verbosity in systemd and PulseAudio showed nothing relevant. Attaching a debugger and strace to the PulseAudio process showed that the signal was received at seemingly random points in the execution of the daemon. This showed that the problem was not directly related to any specific line of code, but otherwise led me nowhere.

When I searched the web, all suggestions seemed to point to the kernel killing the process due to an out-of-memory condition. The kernel being the cause of the signal seemed reasonable, however OOM condition is usually clearly logged. Another guess was that systemd itself was killing the process after I learned about its resource control feature. This turned out to be a dead end as well.

The first real useful clue was when Gašper gave me a crash course on using perf. After setting up debugging symbols for the kernel and the PulseAudio binary, I used the following:

$ perf_4.9 record -g -a -e 'signal:*'
(trigger the restart here)
$ perf_4.9 script

This printed out a nice backtrace of the code that generated the problematic SIGKILL signal (among noise about uninteresting other signals being sent between processes on the system):

alsa-sink-ALC32 12510 [001] 24535.773432: signal:signal_generate: sig=9 errno=0 code=128 comm=alsa-sink-ALC32 pid=12510 grp=1 res=0
            7fffbae8982d __send_signal+0x80004520223d ([kernel.kallsyms])
            7fffbae8982d __send_signal+0x80004520223d ([kernel.kallsyms])
            7fffbaef0652 run_posix_cpu_timers+0x800045202522 ([kernel.kallsyms])
            7fffbaea7aa9 scheduler_tick+0x800045202079 ([kernel.kallsyms])
            7fffbaefaa80 tick_sched_timer+0x800045202000 ([kernel.kallsyms])
            7fffbaefa480 tick_sched_handle.isra.12+0x800045202020 ([kernel.kallsyms])
            7fffbaefaab8 tick_sched_timer+0x800045202038 ([kernel.kallsyms])
            7fffbaeebbfe __hrtimer_run_queues+0x8000452020de ([kernel.kallsyms])
            7fffbaeec2dc hrtimer_interrupt+0x80004520209c ([kernel.kallsyms])
            7fffbb41b1c7 smp_apic_timer_interrupt+0x800045202047 ([kernel.kallsyms])
            7fffbb419a66 __irqentry_text_start+0x800045202096 ([kernel.kallsyms])

alsa-sink-ALC32 is the thread in the PulseAudio daemon that is handling the interface between the daemon and the ALSA driver for the audio hardware. The stack trace shows that the signal was generated in the context of that thread, however the originating code was called from a timer interrupt, not a syscall. Specifically, the run_posix_cpu_timers function seemed to be the culprit. This was consistent with the random debugger results I saw before, since interrupts were not in sync with the code running in the thread.

Some digging later I found the following code that is reached from run_posix_cpu_timers via some static functions. Intermediate static functions probably got optimized away by the compiler and don't appear in the perf stack trace:

if (hard != RLIM_INFINITY &&
    tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
	/*
	 * At the hard limit, we just die.
	 * No need to calculate anything else now.
	 */
	__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
	return;
}

Now things started to make sense. Linux kernel implements some limits on how much time a thread with real-time scheduling priority can use before cooperatively yielding the CPU to other threads and processes (via a blocking syscall for instance). If a thread hits this time limit it is silently sent the SIGKILL signal by the kernel. Kernel resource limits are documented in the setrlimit man page (the relevant limit here is RLIMIT_RTTIME). The PulseAudio daemon was setting the ALSA thread to real-time priority and it was getting killed under load.

Using real-time scheduling seems to be the default in PulseAudio 12.0 and the time limit set for the process is 200 ms. The limit for a running daemon can be inspected from shell using prlimit:

$ prlimit --pid PID | grep RTTIME
RTTIME     timeout for real-time tasks           200000    200000 microsecs

Details of real-time scheduling can be adjusted in /etc/pulse/daemon.conf, for instance with:

realtime-scheduling = yes
rlimit-rttime = 400000

Just to make sure that an RTTIME over-run indeed produces such symptoms I made the following C program that intentionally triggers it. Running the program showed that indeed the cause for the SIGKILL in this case isn't logged anywhere and produces a similar perf backtrace:

#include <stdio.h>
#include <sched.h>
#include <sys/resource.h>

int main(int argc, char** argv)
{
        struct sched_param sched_param;
        if (sched_getparam(0, &sched_param) < 0) {
                printf("sched_getparam() failed\n");
                return 1;
        }
        sched_param.sched_priority = sched_get_priority_max(SCHED_RR);
        if (sched_setscheduler(0, SCHED_RR, &sched_param)) {
		printf("sched_setscheduler() failed\n");
		return 1;
	}
        printf("Scheduler set to Round Robin with priority %d...\n",
			sched_param.sched_priority);
        fflush(stdout);

	struct rlimit rlimit;
	rlimit.rlim_cur = 500000;
	rlimit.rlim_max = rlimit.rlim_cur;

	if (setrlimit(RLIMIT_RTTIME, &rlimit)) {
		printf("setrlimit() failed\n");
		return 1;
	}
	printf("RTTIME limit set to %ld us...\n",
			rlimit.rlim_max);

	printf("Hogging the CPU...\n");
	while(1);
}

It would be really nice if kernel would log somewhere the reason for this signal, like it does with OOM. It might be that in this particular case developers wanted to avoid calling possibly expensive logging functions from interrupt context. On the other hand, it seems that kernel by default doesn't log any process kills at all due to resource limit over-runs. Failed syscalls can be logged using auditd, but that wouldn't help here as no syscalls actually failed.

As far as this particular PulseAudio application was concerned, there weren't really any perfect solutions. This didn't look like a bug in PulseAudio but rather a fact of life in a constrained environment with real-time tasks. The PulseAudio man page discusses some trade-offs of real-time scheduling (which is nice in hindsight, but you first have to know where to look). In my specific case, there were more or less only three possibilities of how to proceed:

  1. Disable RTTIME limit and accept that PulseAudio might freeze other processes on the device for an arbitrary amount of time,
  2. disable real-time scheduling and accept occasional skips in the audio due to other processes taking the CPU from PulseAudio for too long, or
  3. accept the fact that PulseAudio will restart occasionally and make other software on the device recover from this case as best as possible.

After considering implications to the functionality of the device I went with the last one in the end. I also slightly increased the default RTTIME limit so that restarts would be less common while still having an acceptable maximum response time for other processes.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)