Repository

Looks good to me!

User Tools

Site Tools


kb:intranet:platforms:linux:troubleshooting

Troubleshooting

This page describes problems encountered when setting up and administering the server. Not regularly updated, except perhaps the issues where the cause and/or fix is still unknown.

Operating system Ubuntu Server 20.04 LTS
Hardware Intel NUC10i3FNH


2024-05-10 Friday

Cannot find Ubuntu kernel headers for building modules
Observations -
Cause Installation of mainline kernel resulted in a kernel that is not compatible with other kernel libraries built on the corresponding apt repositories.
Fix Either downgrade, or reinstall kernel. For downgrade: grep 'menuentry \|submenu ' /boot/grub/grub.cfg | cut -f2 -d "'", then write to /etc/default/grub GRUB_DEFAULT="Advanced options...>Ubuntu, with Linux...". Delete the kernel with sudo apt purge ...

2024-02-12 Monday

journalctl spammed with action 'action-0-builtin:omfile' suspended messages
Observations sudo journalctl -fx shows this message multiple times, at every minute or so: Feb 12 16:44:31 - rsyslogd[6728]: action 'action-0-builtin:omfile' suspended (module 'builtin:omfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2112.0 try https://www.rsyslog.com/e/2007 ]
Cause GitHub comment. Likely because syslog is made to write to a file which it does not have permissions for. Configuration found in /etc/rsyslog.conf and /etc/rsyslog.d/50-default.conf, to see what user is used to write to which log files. In this instance, syslog was not given write permissions to /var/log.
Fix sudo chmod g+w /var/log

2023-12-11 Monday

nginx and ssh failed
Observations
Cause
Fix Reboot services.

The first log triggering a 504 Gateway Timeout occurred when YandexBot tried to poll https://pyuxiang.com/ at 1.06am on 11 Dec from 213.180.203.29. Tracking backwards:

  • Error 499 by PetalBot on Dokuwiki (client stop) at 1.05am
  • Error 499 by Vivaldi user (202.142.162.187) on Dokuwiki (/wiki/syntax) at 12.36am
  • PHP upstream timed out (110: Unknown error) while reading... upstream: fastcgi://unix:/var/run/php/php8.1-fpm.sock started at 1.06am
    • 12.36am remote logoff service logs dropped
    • An RDP probe was triggered:
  • Services:
    • Nothing wrong with PHP service.
    • SSHD logs do not look anomalous.
    • Nginx service as well.

No idea.


2023-05-02 Tuesday

High IRQ observed
Observations Running glances shows a high rate of interrupt requests from 18_i2c_designware.1, around 18kreq/s, similarly visible via cat /proc/interrupts (with log 18: 0 0 889547135 0 IR-IO-APIC 18-fasteoi idma64.1, i2c_designware.1). A corresponding [irq/65-i2c-INT3] is present in the process list as well. sudo powertop corroborates with ~200 interrupts per second by [18] idma64.1 and ~1.8ms/s usage by process 55974 irq/65-i2c-INT3, with corresponding 0.5% CPU load on CPU2.
Cause See forum for answer, reproduced below. Bug filed on Ubuntu but regression observed, likely issue with Linux kernel itself.
Fix Running modprobe -r tps6598x to disable the problematic driver solves this temporarily until a power cycle.
My guess is that the IRQ resource is not correct for the PD controller causing you to see irq flood.

The problem is that the ACPI device entry (the node) on this platform has 4 I2CSerialBus resources and 4 IRQ resources. The idea is that the single ACPI device entry can represent up to 4 USB PD controllers. The problem is that there is no way to know which IRQ resource belongs to which I2CSerialBus resource :-(.

And, this is one of those multi-instantiate I2C slave devices with HID INT3515.

The only solution I can think of is that we start maintaining DMI quirk table in drivers/platform/x86/i2c-multi-instantiate.c where we supply the correct i2c to irq resource mapping for every platform that has this device(s).


2023-01-16 Monday

Docker service failed to start
Observations See logs below.
Cause Likely due to server shutdown during Docker startup, resulting in corrupted files.
Fix Nuked the whole docker configuration using sudo rm -rf /var/lib/docker as suggested by GitHub. Storing of volumes in local filesystem instead of as a native Docker volume mitigated all the damages.

Logs

See the other following supporting information: [1] [2] [3] [4]


2022-12-26 Monday

Attempting to access crontab raises SegmentationFault
Observations Both crontab -e and sudo crontab -e raises SegmentationFault, with no error message. Seems like pip3 is also part of this list now.
Cause Unknown, likely file corruption due to long uptime of non-ECC server. According to SO, is attributable to either crontab itself, a damaged filesystem, or some hardware problem.
Fix Unknown


Extensive use of swap memory.
Observations Server responses became noticeably a lot more sluggish. Checking top and mem-something showed low memory footprint, but swap space is being actively used. No tmpfs occupying space either.
Cause Suspected due to long uptime of server, resulting in a memory leak / dangling memory pointers, etc.
Fix Restart the server.


Unable to SSH into system.
Observations After a couple minutes of uptime, the server stops responding to incoming SSH requests. Connection dropped either due to broken pipes, or kex_exchange_identification: read: Connection reset.
Cause Unknown.
Fix Restart the server.


2021-10-17 Sunday

Failure to get docker service running
Observations See below.
Cause No idea.
Fix See below as well.

Observations

Changed a FUSE mount target earlier but did not change /etc/fstab. Tried to restart a docker container (stopped container, edited fstab, mount all mount -a, docker compose up), but failed with following error:

$ sudo docker-compose up -d
Starting pigallery2 ... error

ERROR: for pigallery2  Cannot start service pigallery2: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: no cgroup mount found in mountinfo: unknown

ERROR: for pigallery2  Cannot start service pigallery2: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: no cgroup mount found in mountinfo: unknown
ERROR: Encountered errors while bringing up the project.

Quick search indicated potential issue with some cgroupfs mount. Ran the following as suggested here:

# Initially empty directory systemd/
sudo mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd

Did not work. Tried to download the package, but surprising did not do anything.

https://unix.stackexchange.com/questions/249425/cant-run-docker-hello-world-mountpoint-for-devices-not-found
sudo apt install cgroupfs-mount

An article suggested containerd was at fault, tried to restart docker service and failed:

sudo systemctl status docker.service
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2021-10-17 20:51:24 +08; 5s ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
    Process: 3698790 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, st>   Main PID: 3698790 (code=exited, status=1/FAILURE)

Oct 17 20:51:24 wasabi systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Oct 17 20:51:24 wasabi systemd[1]: Stopped Docker Application Container Engine.
Oct 17 20:51:24 wasabi systemd[1]: docker.service: Start request repeated too quickly.
Oct 17 20:51:24 wasabi systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 17 20:51:24 wasabi systemd[1]: Failed to start Docker Application Container Engine.

Reinstalled docker.io, failed. The journal left the following message:

/var/www/pigallery2$ journalctl -u docker.service --since 2021-10-01
Oct 17 20:51:17 wasabi systemd[1]: Starting Docker Application Container Engine...
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.711556022+08:00" level=info msg="Starting up"
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.711991071+08:00" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.712782813+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.712808067+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.712840056+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.712881166+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.828462379+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.828550780+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.828599676+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.828631419+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Oct 17 20:51:17 wasabi dockerd[3698766]: time="2021-10-17T20:51:17.939964748+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Oct 17 20:51:18 wasabi dockerd[3698766]: time="2021-10-17T20:51:18.195553161+08:00" level=warning msg="Your kernel does not support cgroup memory limit"
Oct 17 20:51:18 wasabi dockerd[3698766]: time="2021-10-17T20:51:18.195623222+08:00" level=warning msg="Unable to find cpu cgroup in mounts"
Oct 17 20:51:18 wasabi dockerd[3698766]: time="2021-10-17T20:51:18.195642082+08:00" level=warning msg="Unable to find blkio cgroup in mounts"
Oct 17 20:51:18 wasabi dockerd[3698766]: time="2021-10-17T20:51:18.195656706+08:00" level=warning msg="Unable to find cpuset cgroup in mounts"
Oct 17 20:51:18 wasabi dockerd[3698766]: time="2021-10-17T20:51:18.195671069+08:00" level=warning msg="Unable to find pids cgroup in mounts"
Oct 17 20:51:18 wasabi dockerd[3698766]: failed to start daemon: Devices cgroup isn't mounted
Oct 17 20:51:18 wasabi systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Oct 17 20:51:18 wasabi systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 17 20:51:18 wasabi systemd[1]: Failed to start Docker Application Container Engine.
Oct 17 20:51:20 wasabi systemd[1]: docker.service: Scheduled restart job, restart counter is at 1.

Suggests a cgroupfs issue. Running the following did not work, even with containerd stopped using systemctl:

$ sudo cgroupfs-umount
rmdir: failed to remove 'blkio': Read-only file system
rmdir: failed to remove 'cpu': Read-only file system
rmdir: failed to remove 'cpu,cpuacct': Read-only file system
rmdir: failed to remove 'cpuacct': Read-only file system
rmdir: failed to remove 'cpuset': Read-only file system
rmdir: failed to remove 'devices': Read-only file system
rmdir: failed to remove 'freezer': Read-only file system
rmdir: failed to remove 'hugetlb': Read-only file system
rmdir: failed to remove 'memory': Read-only file system
rmdir: failed to remove 'net_cls': Read-only file system
rmdir: failed to remove 'net_cls,net_prio': Read-only file system
rmdir: failed to remove 'net_prio': Read-only file system
rmdir: failed to remove 'perf_event': Read-only file system
rmdir: failed to remove 'pids': Read-only file system
rmdir: failed to remove 'rdma': Read-only file system
rmdir: failed to remove 'systemd': Read-only file system
umount: /sys/fs/cgroup/unified: target is busy.

Ran this again, and now the docker is behaving...?

sudo umount /sys/fs/cgroup/systemd
sudo mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd

In other words, a big shot in the dark. Very painful. This perhaps helped as well:

# https://github.com/docker/cli/issues/2104#issuecomment-535513985
service docker stop
service containerd stop
cgroupfs-umount
cgroupfs-mount
service containerd start
service docker start


2021-10-06 Wednesday

(Duplicate) Same issue as in "Docker containers terminated with exit code 143"
Observations Attempts to access Seafile server at the address returns 502 Bad Gateway, i.e. the service is likely down. Quick check of journalctl shows termination of docker services at Oct 06 06:20:37. Docker shows certain services are still up and running.
Cause Restart of docker container engine, and containers without restart: always or restart: unless-stopped flags immediately terminate.
Fix Add the restart: always flag to the Seafile server.


2021-08-08 Sunday

UFW blocking accesses to port 55711/60511
Observations Saw multiple [UFW BLOCK] logs in the journalctl logs, see below. Source is coming from another server in the same network.
Cause This post suggests to do a netstat scan, i.e. sudo netstate -tulpen | grep 55711.
Fix Unknown
Aug 07 06:18:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=445 TOS=0x00 PREC=0x00 TTL=64 ID=50433 DF PROTO=UDP SPT=57164 DPT=60511 LEN=425
Aug 07 06:18:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=449 TOS=0x00 PREC=0x00 TTL=64 ID=50434 DF PROTO=UDP SPT=57164 DPT=60511 LEN=429
Aug 07 06:18:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=481 TOS=0x00 PREC=0x00 TTL=64 ID=50435 DF PROTO=UDP SPT=57164 DPT=60511 LEN=461
Aug 07 06:19:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=445 TOS=0x00 PREC=0x00 TTL=64 ID=51885 DF PROTO=UDP SPT=37476 DPT=55711 LEN=425
Aug 07 06:19:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=449 TOS=0x00 PREC=0x00 TTL=64 ID=51886 DF PROTO=UDP SPT=37476 DPT=55711 LEN=429
Aug 07 06:19:19 wasabi kernel: [UFW BLOCK] IN=eno1 OUT= MAC=1c:69:7a:6c:69:27:24:5e:be:30:8c:f4:08:00 SRC=10.99.101.94 DST=10.99.101.2 LEN=481 TOS=0x00 PREC=0x00 TTL=64 ID=51887 DF PROTO=UDP SPT=37476 DPT=55711 LEN=461


Docker containers terminated with exit code 143
Observations Tried to access Seafile service, returned 502 Bad Gateway. Called sudo docker ps -a, which reflected Exited (143) 40 hours ago (Saturday 6.30am), for all running Docker containers. Cause of failure suggested to be a SIGTERM sent. Logs for Jellyfin indicated [22:32:39] [INF] [2] Main: Received a SIGTERM signal, shutting down. Going to logs for docker daemon using journalctl -u docker, reproduced below. See full journalctl logs.
Cause Google search for "systemd stops docker" yielded this answer - seems like this occurred universally 8 months ago on 1 Dec 2020. An unattended upgrade forced the Docker.io service to restart. See the following bug report.
Fix Fixed in docker.io commit 20.10.7, should not occur again (see below for detailed flow of events). Alternative is to install docker-ce instead.
journalctl
Aug 07 06:31:52 wasabi systemd[1]: Starting Daily apt upgrade and clean activities...
Aug 07 06:32:23 wasabi apt-helper[2435189]: E: Sub-process nm-online returned an error code (1)
Aug 07 06:32:33 wasabi systemd[1]: Reloading.
Aug 07 06:32:36 wasabi systemd[1]: Reloading.
Aug 07 06:32:37 wasabi systemd[1]: Reloading.
Aug 07 06:32:37 wasabi systemd[1]: Stopping Docker Application Container Engine...
Aug 07 06:32:37 wasabi dockerd[1357]: time="2021-08-07T06:32:37.458394503+08:00" level=info msg="Processing signal 'terminated'"
Aug 07 06:32:39 wasabi dockerd[1357]: time="2021-08-07T06:32:39.326944811+08:00" level=info msg="ignoring event" container=fa533381a27252b36c733e2310cb9c9c9d33752890ae6cdc9b82e1bfd4dc86c0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 07 06:32:39 wasabi containerd[3472490]: time="2021-08-07T06:32:39.323539065+08:00" level=info msg="shim disconnected" id=fa533381a27252b36c733e2310cb9c9c9d33752890ae6cdc9b82e1bfd4dc86c0
Aug 07 06:32:39 wasabi containerd[3472490]: time="2021-08-07T06:32:39.329096969+08:00" level=warning msg="cleaning up after shim disconnected" id=fa533381a27252b36c733e2310cb9c9c9d33752890ae6cdc9b82e1bfd4dc86c0 namespace=moby
Aug 07 06:32:39 wasabi containerd[3472490]: time="2021-08-07T06:32:39.329112577+08:00" level=info msg="cleaning up dead shim"
Aug 07 06:32:39 wasabi containerd[3472490]: time="2021-08-07T06:32:39.587822051+08:00" level=warning msg="cleanup warnings time=\"2021-08-07T06:32:39+08:00\" level=info msg=\"starting signal loop\" namespace=moby pid=2435612\n"
...
Aug 07 06:32:45 wasabi dockerd[2436072]: time="2021-08-07T06:32:45.936923653+08:00" level=info msg="Daemon has completed initialization"
Aug 07 06:32:45 wasabi systemd[1]: Started Docker Application Container Engine.
Aug 07 06:32:45 wasabi dockerd[2436072]: time="2021-08-07T06:32:45.953215704+08:00" level=info msg="API listen on /run/docker.sock"
Aug 07 06:32:46 wasabi systemd[1]: fwupd-refresh.service: Succeeded.
Aug 07 06:32:46 wasabi systemd[1]: Finished Refresh fwupd metadata and update motd.
Aug 07 06:32:49 wasabi dbus-daemon[837]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.1015' (uid=0 pid=2437256 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")
Aug 07 06:32:49 wasabi systemd[1]: Starting PackageKit Daemon...
Aug 07 06:32:50 wasabi PackageKit[2437259]: daemon start
Aug 07 06:32:50 wasabi dbus-daemon[837]: [system] Successfully activated service 'org.freedesktop.PackageKit'
Aug 07 06:32:50 wasabi systemd[1]: Started PackageKit Daemon.
Aug 07 06:32:51 wasabi systemd[1]: certbot.service: Succeeded.
Aug 07 06:32:51 wasabi systemd[1]: Finished Certbot.
Aug 07 06:32:54 wasabi systemd[1]: apt-daily-upgrade.service: Succeeded.
Aug 07 06:32:54 wasabi systemd[1]: Finished Daily apt upgrade and clean activities.
Aug 07 06:37:55 wasabi PackageKit[2437259]: daemon quit
Aug 07 06:37:55 wasabi systemd[1]: packagekit.service: Succeeded.
/var/log/unattended-upgrades/unattended-upgrades.log
2021-08-07 06:32:23,978 INFO Starting unattended upgrades script
2021-08-07 06:32:23,978 INFO Allowed origins are: o=Ubuntu,a=focal, o=Ubuntu,a=focal-security, o=UbuntuESMApps,a=focal-apps-security, o=UbuntuESM,a=focal-infra-security
2021-08-07 06:32:23,978 INFO Initial blacklist:
2021-08-07 06:32:23,978 INFO Initial whitelist (not strict):
2021-08-07 06:32:25,988 INFO Packages that will be upgraded: docker.io
2021-08-07 06:32:25,988 INFO Writing dpkg log to /var/log/unattended-upgrades/unattended-upgrades-dpkg.log
2021-08-07 06:32:54,416 INFO All upgrades installed

Other

Checked changelogs for docker.io package for Focal 20.04. As per original bug report on 1 Dec (above), systemd sending SIGTERM was resolved in 19.03.13.

docker.io (19.03.13-0ubuntu4) hirsute; urgency=medium

  * d/p/do_not_bind_docker_to_containerd.patch: Update docker.io to not
    stop when containerd is upgraded, by using Wants= rather than BindTo=.
    (LP: #1870514)
  * d/rules: Fix docker.io to not restart its service during package
    upgrades, to prevent service downtime from automatic updates via
    unattended-upgrade.
    (LP: #1906364)

 -- Bryce Harrington <bryce@canonical.com>  Fri, 04 Dec 2020 23:02:49 +0000

Previous version of docker.io on server is 20.10.2-0ubuntu1~20.04.2, which is backported from 20.10.2-0ubuntu1 (hirsute) (found by searching journalctl | grep commit). New package update below, likely triggered a package upgrade since the update was filed under focal-security. Fixes regression due to dh_systemd_start deprecation.

docker.io (20.10.7-0ubuntu1~20.04.1) focal-security; urgency=medium

  * Backport version 20.10.7-0ubuntu1 from Impish (LP: #1938908).

 -- Lucas Kanashiro <kanashiro@ubuntu.com>  Wed, 04 Aug 2021 16:07:47 -0300

docker.io (20.10.7-0ubuntu1) impish; urgency=medium

  * New upstream release.
    - Among new features and bug fixes, the CVE-2021-21284 and CVE-2021-21285
      were addressed.
  * d/watch: adjust regex to correctly match the tarball files.
  * d/rules: make some improvements.
    - Adjust regex in the build-manpages target due to some upstream changes.
    - Separately install the systemd service and socket.
    - Tell dh_installsystemd to not stop the service during the upgrade.
      The previous implementation worked fine until debhelper compat 10 where
      dh_systemd_start was still a thing. In compat 11, it was deprecated
      which means that piece of code was not called.

 -- Lucas Kanashiro <kanashiro@ubuntu.com>  Tue, 03 Aug 2021 15:58:42 -0300


2021-06-27 Sunday

NUC unable to connect to network after reboot
Observations Ethernet port of NUC only has an irregularly flashing green LED, while amber LED is not switched on. Observed on two occasions to occur around the 1-2 hour mark after rebooting the server (sudo reboot and via physical button). During this state, the server does not connect to the local network.

After reboot and connecting to a HDMI output, a basic Gnome-based GUI is displayed - only Settings can be opened, built-in Byobu Terminal present but cannot be opened.
Cause Unknown
Fix Installed the ubuntu-desktop-minimal using tasksel install ubuntu-desktop-minimal. Unexpected errors appeared for atom and terminal, both ignored. Disconnected HDMI and keyboard after loading the desktop interface during reboot. Server so far still operational.
kb/intranet/platforms/linux/troubleshooting.txt · Last modified: 8 weeks ago (26 November 2024) by justin