Wiki‎ > ‎

NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

posted Aug 11, 2016, 10:26 PM by Dong Xu   [ updated Aug 14, 2016, 8:35 AM ]
This is GPU0 Titan Z in pepper
see https://forums.gentoo.org/viewtopic-p-7588226.html

[   26.328777] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  361.42  Tue Mar 22 17:29:54 PDT 2016
[   26.356996] nvidia-modeset: Allocated GPU:0 (GPU-1c39faf7-79ab-e7b6-92b7-bd0332ef322b) @ PCI:0000:03:00.0
[   26.668332] nvidia-modeset: Allocated GPU:1 (GPU-3d2c6622-03ab-277e-d867-a59a7a916a9b) @ PCI:0000:04:00.0
[   26.668794] nvidia-modeset: Freed GPU:1 (GPU-3d2c6622-03ab-277e-d867-a59a7a916a9b) @ PCI:0000:04:00.0
[   26.976963] 8021q: 802.1Q VLAN Support v1.8
[   27.797540] sge_execd[1966]: segfault at ffffffff00000004 ip 000000000042d9c4 sp 00007ffc7941a5f0 error 5 in sge_execd[400000+173000]
[   27.903806] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 1692.240315] nvidia-uvm: Loaded the UVM driver in lite mode, major device number 249
[ 2644.955190] perf samples too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[10671.576042] NVRM: GPU at PCI:0000:03:00: GPU-1c39faf7-79ab-e7b6-92b7-bd0332ef322b
[10671.576052] NVRM: GPU Board Serial Number: 0322714070413
[10671.576055] NVRM: Xid (PCI:0000:03:00): 8, Channel 0000001b
[10673.576014] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10675.576025] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10677.576032] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10679.576040] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10681.576047] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10683.576055] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10685.576061] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10687.576068] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10689.576076] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10691.576082] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[10693.576088] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

==================================

[   25.871672] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  361.42  Tue Mar 22 17:29:54 PDT 2016
[   25.873099] nvidia-modeset: Allocated GPU:0 (GPU-1c39faf7-79ab-e7b6-92b7-bd0332ef322b) @ PCI:0000:03:00.0
[   26.118360] nvidia-modeset: Allocated GPU:1 (GPU-3d2c6622-03ab-277e-d867-a59a7a916a9b) @ PCI:0000:04:00.0
[   26.118697] nvidia-modeset: Freed GPU:1 (GPU-3d2c6622-03ab-277e-d867-a59a7a916a9b) @ PCI:0000:04:00.0
[   26.214434] 8021q: 802.1Q VLAN Support v1.8
[   27.191510] sge_execd[1787]: segfault at ffffffff00000004 ip 000000000042d9c4 sp 00007ffe6777a770 error 5 in sge_execd[400000+173000]
[   27.598591] ip_tables: (C) 2000-2006 Netfilter Core Team
[   48.341036] nvidia-uvm: Loaded the UVM driver in lite mode, major device number 249
[  759.326638] perf samples too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[92825.976038] NVRM: GPU at PCI:0000:03:00: GPU-1c39faf7-79ab-e7b6-92b7-bd0332ef322b
[92825.976043] NVRM: GPU Board Serial Number: 0322714070413
[92825.976046] NVRM: Xid (PCI:0000:03:00): 8, Channel 00000018
[92827.976014] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92829.976018] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92831.976022] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92833.976025] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92835.976029] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92837.976031] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92839.976032] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92841.976035] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92843.976038] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92845.976043] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92847.976048] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92848.124002] BUG: soft lockup - CPU#1 stuck for 22s! [gdesmond:26015]
[92848.124005] Modules linked in: nvidia_uvm(POEX) iptable_filter ip_tables x_tables 8021q garp stp llc mrp nvidia_modeset(POEX) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs msr sr_mod cdrom snd_hda_codec_hdmi nvidia(POEX) snd_hda_codec_realtek snd_hda_intel snd_hda_codec iTCO_wdt gpio_ich iTCO_vendor_support snd_hwdep snd_pcm ppdev snd_page_alloc snd_timer snd coretemp kvm_intel kvm lpc_ich mfd_core dm_mod serio_raw soundcore pcspkr i2c_i801 sata_sil r8169 mii thermal parport_pc parport tpm_infineon shpchp acpi_cpufreq button processor ext4 crc16 mbcache jbd2 sd_mod ata_generic ata_piix ahci libahci ehci_pci libata uhci_hcd ehci_hcd usbcore usb_common sg scsi_mod autofs4
[92848.124006] Supported: No, Proprietary modules are loaded
[92848.124006] CPU: 1 PID: 26015 Comm: gdesmond Tainted: P           OE  X 3.12.49-11-default #1
[92848.124006] Hardware name: O.E.M O.E.M/G41AP/G41AP-S, BIOS 080015  03/21/2011
[92848.124006] task: ffff8801ec9b58c0 ti: ffff880234454000 task.ti: ffff880234454000
[92848.124006] RIP: 0010:[<ffffffffa0a98b13>]  [<ffffffffa0a98b13>] _nv014159rm+0x13/0x40 [nvidia]
[92848.124006] RSP: 0000:ffff88023fc83c68  EFLAGS: 00000212
[92848.124006] RAX: 0000000000000967 RBX: ffffffffa0822a9d RCX: 0000000000000967
[92848.124006] RDX: ffffc90011f00000 RSI: ffff8800b9d4c008 RDI: ffff88022d67b008
[92848.124006] RBP: ffff8800ba117020 R08: ffff8800ba36b6c8 R09: ffff8800ba117030
[92848.124006] R10: ffff8800ba116f80 R11: ffffffffa0823e50 R12: ffff88023fc83bd8
[92848.124006] R13: ffffffff8152b25d R14: 0000000000000000 R15: ffffffff8152c7ab
[92848.124006] FS:  00007f01772178c0(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[92848.124006] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[92848.124006] CR2: 00007f018cd15168 CR3: 0000000230199000 CR4: 00000000000407e0
[92848.124006] Stack:
[92848.124006]  ffff8800b9d4c008 ffffffffa0823f68 0000000000000000 0000000000000001
[92848.124006]  ffff8800b9d4c008 ffff880233b90008 0000000000000000 ffffffffa07f3dc4
[92848.124006]  0000000000003300 ffff8800b9d4c008 000000000000003b ffff8802333a3270
[92848.124006] Call Trace:
[92848.124006] Inexact backtrace:

[92848.124006]  <IRQ>

[92848.124006]  [<ffffffffa0823f68>] ? _nv009411rm+0x17c8/0x1a80 [nvidia]
[92848.124006]  [<ffffffffa07f3dc4>] ? _nv008463rm+0xa4/0x170 [nvidia]
[92848.124006]  [<ffffffffa09814b4>] ? _nv015743rm+0xf4/0x4f0 [nvidia]
[92848.124006]  [<ffffffffa0980f36>] ? _nv015736rm+0x1d6/0x390 [nvidia]
[92848.124006]  [<ffffffffa098097f>] ? _nv015737rm+0x69f/0x7c0 [nvidia]
[92848.124006]  [<ffffffffa0984179>] ? _nv015780rm+0xc9/0xe0 [nvidia]
[92848.124006]  [<ffffffffa098413d>] ? _nv015780rm+0x8d/0xe0 [nvidia]
[92848.124006]  [<ffffffffa0986457>] ? _nv015782rm+0x417/0x590 [nvidia]
[92848.124006]  [<ffffffffa0983f99>] ? _nv015781rm+0x69/0x180 [nvidia]
[92848.124006]  [<ffffffffa0a4ffb4>] ? _nv014345rm+0x1b4/0x1210 [nvidia]
[92848.124006]  [<ffffffffa0aa061b>] ? rm_run_rc_callback+0x9b/0xe0 [nvidia]
[92848.124006]  [<ffffffffa0578e20>] ? nvidia_isr_bh+0x70/0x70 [nvidia]
[92848.124006]  [<ffffffffa0578e85>] ? nvidia_rc_timer+0x65/0x90 [nvidia]
[92848.124006]  [<ffffffff81065531>] ? call_timer_fn+0x31/0x100
[92848.124006]  [<ffffffffa0578e20>] ? nvidia_isr_bh+0x70/0x70 [nvidia]
[92848.124006]  [<ffffffff81066379>] ? run_timer_softirq+0x1f9/0x2b0
[92848.124006]  [<ffffffff8105e1e5>] ? __do_softirq+0xe5/0x230
[92848.124006]  [<ffffffff8152bf1c>] ? call_softirq+0x1c/0x30
[92848.124006]  [<ffffffff81004665>] ? do_softirq+0x55/0x90
[92848.124006]  [<ffffffff8105e485>] ? irq_exit+0x95/0xa0
[92848.124006]  [<ffffffff8152c7b5>] ? smp_apic_timer_interrupt+0x45/0x60
[92848.124006]  [<ffffffff8152b25d>] ? apic_timer_interrupt+0x6d/0x80
[92848.124006]  <EOI>

[92848.124006]  [<ffffffff8152a5c9>] ? system_call_fastpath+0x16/0x1b
[92848.124006] Code: ff e8 12 ec fc ff 0f b7 c3 5b c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 12 13 00 00 be 01 00 00 00 48 89 c2 31 ff
[92849.976052] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92851.976056] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92853.976060] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92857.978378] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92859.978382] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92861.978384] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92863.978386] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92865.978388] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92867.978390] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92869.978392] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92871.978395] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92873.978399] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92875.978402] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92876.124001] BUG: soft lockup - CPU#1 stuck for 22s! [gdesmond:26015]
[92876.124001] Modules linked in: nvidia_uvm(POEX) iptable_filter ip_tables x_tables 8021q garp stp llc mrp nvidia_modeset(POEX) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet iscsi_ibft iscsi_boot_sysfs msr sr_mod cdrom snd_hda_codec_hdmi nvidia(POEX) snd_hda_codec_realtek snd_hda_intel snd_hda_codec iTCO_wdt gpio_ich iTCO_vendor_support snd_hwdep snd_pcm ppdev snd_page_alloc snd_timer snd coretemp kvm_intel kvm lpc_ich mfd_core dm_mod serio_raw soundcore pcspkr i2c_i801 sata_sil r8169 mii thermal parport_pc parport tpm_infineon shpchp acpi_cpufreq button processor ext4 crc16 mbcache jbd2 sd_mod ata_generic ata_piix ahci libahci ehci_pci libata uhci_hcd ehci_hcd usbcore usb_common sg scsi_mod autofs4
[92876.124001] Supported: No, Proprietary modules are loaded
[92876.124001] CPU: 1 PID: 26015 Comm: gdesmond Tainted: P           OE  X 3.12.49-11-default #1
[92876.124001] Hardware name: O.E.M O.E.M/G41AP/G41AP-S, BIOS 080015  03/21/2011
[92876.124001] task: ffff8801ec9b58c0 ti: ffff880234454000 task.ti: ffff880234454000
[92876.124001] RIP: 0010:[<ffffffffa0a98b13>]  [<ffffffffa0a98b13>] _nv014159rm+0x13/0x40 [nvidia]
[92876.124001] RSP: 0000:ffff88023fc83c68  EFLAGS: 00000212
[92876.124001] RAX: 0000000000000967 RBX: ffffffffa0822a9d RCX: 0000000000000967
[92876.124001] RDX: ffffc90011f00000 RSI: ffff8800b9d4c008 RDI: ffff88022d67b008
[92876.124001] RBP: ffff8800ba117008 R08: ffff8800ba36b6c8 R09: ffff8800ba117018
[92876.124001] R10: ffff8800ba116f68 R11: ffffffffa0823e50 R12: ffff88023fc83bd8
[92876.124001] R13: ffffffff8152b25d R14: 0000000000000000 R15: ffffffff8152c7ab
[92876.124001] FS:  00007f01772178c0(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[92876.124001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[92876.124001] CR2: 00007f018cd15168 CR3: 0000000230199000 CR4: 00000000000407e0
[92876.124001] Stack:
[92876.124001]  ffff8800b9d4c008 ffffffffa0823f68 0000000000000000 0000000000000001
[92876.124001]  ffff8800b9d4c008 ffff880233b90008 0000000000000000 ffffffffa07f3dc4
[92876.124001]  ffff88023436c008 ffff8800b9d4c008 0000000000000000 ffff880233128808
[92876.124001] Call Trace:
[92876.124001] Inexact backtrace:

[92876.124001]  <IRQ>

[92876.124001]  [<ffffffffa0823f68>] ? _nv009411rm+0x17c8/0x1a80 [nvidia]
[92876.124001]  [<ffffffffa07f3dc4>] ? _nv008463rm+0xa4/0x170 [nvidia]
[92876.124001]  [<ffffffffa0986969>] ? _nv015754rm+0xf9/0x2a0 [nvidia]
[92876.124001]  [<ffffffffa0980f4e>] ? _nv015736rm+0x1ee/0x390 [nvidia]
[92876.124001]  [<ffffffffa098097f>] ? _nv015737rm+0x69f/0x7c0 [nvidia]
[92876.124001]  [<ffffffffa0984179>] ? _nv015780rm+0xc9/0xe0 [nvidia]
[92876.124001]  [<ffffffffa098413d>] ? _nv015780rm+0x8d/0xe0 [nvidia]
[92876.124001]  [<ffffffffa0986457>] ? _nv015782rm+0x417/0x590 [nvidia]
[92876.124001]  [<ffffffffa0983f99>] ? _nv015781rm+0x69/0x180 [nvidia]
[92876.124001]  [<ffffffffa0a4ffb4>] ? _nv014345rm+0x1b4/0x1210 [nvidia]
[92876.124001]  [<ffffffffa0aa061b>] ? rm_run_rc_callback+0x9b/0xe0 [nvidia]
[92876.124001]  [<ffffffffa0578e20>] ? nvidia_isr_bh+0x70/0x70 [nvidia]
[92876.124001]  [<ffffffffa0578e85>] ? nvidia_rc_timer+0x65/0x90 [nvidia]
[92876.124001]  [<ffffffff81065531>] ? call_timer_fn+0x31/0x100
[92876.124001]  [<ffffffffa0578e20>] ? nvidia_isr_bh+0x70/0x70 [nvidia]
[92876.124001]  [<ffffffff81066379>] ? run_timer_softirq+0x1f9/0x2b0
[92876.124001]  [<ffffffff8105e1e5>] ? __do_softirq+0xe5/0x230
[92876.124001]  [<ffffffff8152bf1c>] ? call_softirq+0x1c/0x30
[92876.124001]  [<ffffffff81004665>] ? do_softirq+0x55/0x90
[92876.124001]  [<ffffffff8105e485>] ? irq_exit+0x95/0xa0
[92876.124001]  [<ffffffff8152c7b5>] ? smp_apic_timer_interrupt+0x45/0x60
[92876.124001]  [<ffffffff8152b25d>] ? apic_timer_interrupt+0x6d/0x80
[92876.124001]  <EOI>

[92876.124001]  [<ffffffff8152a5c9>] ? system_call_fastpath+0x16/0x1b
[92876.124001] Code: ff e8 12 ec fc ff 0f b7 c3 5b c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 12 13 00 00 be 01 00 00 00 48 89 c2 31 ff
[92877.978406] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92879.978410] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92881.978413] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92883.978416] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[92885.978421] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt conte

=================================
Indeed gpu_burn tells that this GPU is bad


Message from syslogd@pepper at Aug 13 14:27:35 ...
 kernel:[92848.124002] BUG: soft lockup - CPU#1 stuck for 22s! [gdesmond:26015]

Message from syslogd@pepper at Aug 13 14:28:03 ...
 kernel:[92876.124001] BUG: soft lockup - CPU#1 stuck for 22s! [gdesmond:26015]
dxu@pepper:/host/tmp/dxu> tail -100 pepper_gpuburn_10800.log
nohup: ignoring input
GPU 0: GeForce GTX TITAN Z (UUID: GPU-1c39faf7-79ab-e7b6-92b7-bd0332ef322b)
GPU 1: GeForce GTX TITAN Z (UUID: GPU-3d2c6622-03ab-277e-d867-a59a7a916a9b)
10.0%  proc'd: 930699 / 921403   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
        Summary at:   Sun Aug 14 05:18:03 MDT 2016

20.0%  proc'd: 1847968 / 1834664   errors: 0 / 0   temps: 86 C / 83 C
        Summary at:   Sun Aug 14 05:36:02 MDT 2016

30.0%  proc'd: 2766580 / 2739783   errors: 0 / 1  (WARNING!)  temps: 86 C / 83 C
        Summary at:   Sun Aug 14 05:54:04 MDT 2016

40.0%  proc'd: 3683849 / 3651687   errors: 0 / 3  (WARNING!)  temps: 86 C / 82 C
        Summary at:   Sun Aug 14 06:12:05 MDT 2016

50.0%  proc'd: 4602461 / 4556806   errors: 0 / 1  (WARNING!)  temps: 86 C / 82 C
        Summary at:   Sun Aug 14 06:30:06 MDT 2016

60.1%  proc'd: 5521073 / 5464639   errors: 0 / 3  (WARNING!)  temps: 86 C / 83 C
        Summary at:   Sun Aug 14 06:48:07 MDT 2016

70.1%  proc'd: 6439685 / 6377900   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
        Summary at:   Sun Aug 14 07:06:08 MDT 2016

80.1%  proc'd: 7358297 / 7292518   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
        Summary at:   Sun Aug 14 07:24:09 MDT 2016

90.1%  proc'd: 8275566 / 8201708   errors: 0 / 0   temps: 86 C / 82 C
        Summary at:   Sun Aug 14 07:42:10 MDT 2016

100.0%  proc'd: 9186120 / 9104113   errors: 0 / 1  (WARNING!)  temps: 86 C / 83 C
Killing processes.. done

Tested 2 GPUs:
        GPU 0: OK
        GPU 1: FAULTY



dxu@pepper:/host/tmp/dxu> grep WARN pepper_gpuburn_10800.log
10.0%  proc'd: 930699 / 921403   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
30.0%  proc'd: 2766580 / 2739783   errors: 0 / 1  (WARNING!)  temps: 86 C / 83 C
40.0%  proc'd: 3683849 / 3651687   errors: 0 / 3  (WARNING!)  temps: 86 C / 82 C
50.0%  proc'd: 4602461 / 4556806   errors: 0 / 1  (WARNING!)  temps: 86 C / 82 C
60.1%  proc'd: 5521073 / 5464639   errors: 0 / 3  (WARNING!)  temps: 86 C / 83 C
70.1%  proc'd: 6439685 / 6377900   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
80.1%  proc'd: 7358297 / 7292518   errors: 0 / 2  (WARNING!)  temps: 86 C / 83 C
100.0%  proc'd: 9186120 / 9104113   errors: 0 / 1  (WARNING!)  temps: 86 C / 83 C
Comments