728x90
반응형
  1. 에러 로그 확인kern 로그를 보면 위와 같이 하드웨어 에러가 떨어져 있다.
  2. [test] root@crp-san-xenserver07 /var/log 07:18 오전 root # tail -30 kern.log Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592903] {1}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592904] {1}[Hardware Error]: event severity: corrected Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592905] {1}[Hardware Error]: Error 0, type: corrected Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592906] {1}[Hardware Error]: fru_text: B1 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592907] {1}[Hardware Error]: section_type: memory error Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592907] {1}[Hardware Error]: error_status: 0x0000000000000400 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592908] {1}[Hardware Error]: physical_address: 0x0000002be8411480 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592910] {1}[Hardware Error]: node: 2 card: 0 module: 0 rank: 1 bank: 1 device: 4 row: 22160 column: 88 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592910] {1}[Hardware Error]: error_type: 2, single-bit ECC Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592911] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592948] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592948] {2}[Hardware Error]: It has been corrected by h/w and requires no further action Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592949] {2}[Hardware Error]: event severity: corrected Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592949] {2}[Hardware Error]: Error 0, type: corrected Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592950] {2}[Hardware Error]: section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592951] {2}[Hardware Error]: section length: 0x38 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592956] {2}[Hardware Error]: 00000000: 01010001 00000000 e8411000 0000002b ..........A.+... Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592962] {2}[Hardware Error]: 00000010: 00001000 00000000 e8411fff 0000002b ..........A.+... Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592966] {2}[Hardware Error]: 00000020: 00000080 00000000 00000000 00000000 ................ Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592968] {2}[Hardware Error]: 00000030: 00000000 00000000 ........ Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592975] mce: [Hardware Error]: Machine check events logged Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592998] EDAC skx MC2: HANDLING MCE MEMORY ERROR Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592999] EDAC skx MC2: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593000] EDAC skx MC2: TSC 0 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593001] EDAC skx MC2: ADDR 2be8411480 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593002] EDAC skx MC2: MISC 0 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593002] EDAC skx MC2: PROCESSOR 0:50657 TIME 1666471643 SOCKET 0 APIC 0 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593010] EDAC MC2: 0 CE memory read error on CPU_SrcID#1_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2be8411 offset:0x480 grain:32 syndrome:0x0 - err_code:0000:009f socket:1 imc:0 rank:1 bg:0 ba:1 row:5690 col:58) Oct 23 06:07:29 crp-san-xenserver07.test.co.kr kernel: [2909958.818467] UDP Refuse: IN=xenbr0 OUT= MAC=ff:ff:ff:ff:ff:ff:c6:c3:9a:b4:3a:26:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=317 TOS=0x00 PREC=0xC0 TTL=64 ID=0 DF PROTO=UDP SPT=68 DPT=67 LEN=297 Oct 23 07:07:40 crp-san-xenserver07.test.co.kr kernel: [2913569.427245] UDP Refuse: IN=xenbr0 OUT= MAC=ff:ff:ff:ff:ff:ff:c6:c3:9a:b4:3a:26:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=317 TOS=0x00 PREC=0xC0 TTL=64 ID=0 DF PROTO=UDP SPT=68 DPT=67 LEN=297
  3.  EDAC
    • 하드웨어 에러검출 및 정정을 지원하는 Linux Kernel Module 중 하나이다.
    • PCI 버스 전송에러 및 주변 장치 에러검출도 지원
    • MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록되는데 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다.간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류로 기록되는 경우가 있다.
    • 메시지 발생 시 하드웨어 정보(iLO,IML)을 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나 OS의 MCE 감지 기능을 비활성화 하는 것이 좋다.
    •  
728x90

Types of errors

  • Correctable Error (CE) - the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
  • Uncorrected Error (UE) - the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
  • Fatal Error - when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
  • Non-fatal Error - when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
  1. 해당 로그를 보면 ce memory read error로 나타난다.ce 이벤트는 수정가능한 오류이지만 자주 발생하거나 빈도가 잦으면 교체가 필요하다
  2. EDAC는 어떤 메모리 행 또는 채널이 참조하는지에 대한 정보를 제공하지 않으므로 아래와 같이 조사를 해서 조금 더 알아 볼수 있음

Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592998] EDAC skx MC2: HANDLING MCE MEMORY ERROR Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.592999] EDAC skx MC2: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593000] EDAC skx MC2: TSC 0 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593001] EDAC skx MC2: ADDR 2be8411480 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593002] EDAC skx MC2: MISC 0 Oct 23 05:47:23 crp-san-xenserver07.test.co.kr kernel: [2908752.593002] EDAC skx MC2: PROCESSOR 0:50657 TIME 1666471643 SOCKET 0 APIC 0 # mc2에서 발생한 것을 위 커널로그로 확인 cd /sys/devices/system/edac/mc root # ls -ltr 합계 0 -rw-r--r-- 1 root root 4096 10월 23 07:22 uevent drwxr-xr-x 2 root root 0 10월 23 07:23 power drwxr-xr-x 7 root root 0 10월 23 07:23 mc2 drwxr-xr-x 7 root root 0 10월 23 07:23 mc0 lrwxrwxrwx 1 root root 0 10월 23 07:23 subsystem -> ../../../../bus/edac drwxr-xr-x 7 root root 0 10월 23 07:23 mc3 drwxr-xr-x 7 root root 0 10월 23 07:23 mc1 root # cd mc2 [test] root@crp-san-xenserver07 /sys/devices/system/edac/mc/mc2 07:45 오전 root # ls ce_count ce_noinfo_count dimm0 dimm1 dimm2 dimm3 max_location mc_name power reset_counters seconds_since_reset size_mb subsystem ue_count ue_noinfo_count uevent 해당 로케이션에서 ce_count나 dimm* 안에 ce_count가 있는 것을 확인 할 수 있다. [test] root@crp-san-xenserver07 /sys/devices/system/edac/mc 07:43 오전 root # cat mc*/dimm*/dimm_ce_count 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

728x90

추가적으로 lshw나 lspci, dmidecode로 추가적으로 살펴볼 수 있다.

  1. lshw *-memory description: System Memory physical id: 1000 slot: System board or motherboard size: 256GiB capacity: 7680GiB capabilities: ecc configuration: errordetection=multi-bit-ecc *-bank:0 description: DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns) product: HMA82GR7CJR8N-WM vendor: 00AD063200AD physical id: 0 serial: 127D8C2F slot: A1 size: 16GiB width: 64 bits clock: 2933MHz (0.3ns) *-bank:1 description: DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns) product: HMA82GR7CJR8N-WM vendor: 00AD063200AD physical id: 1 serial: 127D8C17 slot: A2 size: 16GiB width: 64 bits clock: 2933MHz (0.3ns) # bank 넘버나 slot에 대해서 확인 할 수 있음. # 로그에서는 bank:255 라고 나타났는데 lshw에서 255는 찾아볼 수 없었음 # EDAC skx MC2: HANDLING MCE MEMORY ERROR # EDAC skx MC2: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
  2. 나중에 다른 형태로 발생한 로그에 대해서 해당 페이지를 이어서 업데이트 할 필요가 있음

 

728x90
300x250

+ Recent posts