Wiki‎ > ‎

LVM and Raid disk array management and re-activation

posted Nov 19, 2014, 4:17 PM by Dong Xu   [ updated Feb 16, 2016, 8:43 AM ]
Use smartd, lsscsi, dmesg, /var/log/warn and hdparm to pinpoint the BAD drive

hdparm -i /dev/sd[abc...]

To check hard drives block id,

/sbin/blkid

also

ls -l /dev/disk/by-id/

LIST ALL DRIVES with Serial Numbers


Update: after 2 consecutive disk failure, I lost most of my data...

Lesson learned: (1) Seagate sucks big time, only warrant up to 2400 power-on hrs (100 days??). I will avoid Seagate/WD at all cost, Hitachi, Toshiba, and Samsung are better than the worst!  (2) Replace any disk that show slight sign of failure in smartd, dmesg and warn log; (3) Constant backup! (4) No confidence in Raid5 and XFS, the new system will use 2 LVM arrays, task + backup, ext3 (up to 16TB). If one disk fails, replace it and set up a new volume group, then rsync data.

LVM

pvdisplay
pvscan
vgdisplay
vgscan
lvdisplay
lvscan


Set up volume groups and logical volumes using Expert Partitioner.

RedHat has pretty good doc: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_CLI.html

Detailed LVM commands: http://wiki.gentoo.org/wiki/LVM

For a bad drive, delete the VG in Patitioner first and replace it.
Put in a new drive, set up a new VG, format LV, and rsync all the files.

Detailed SMART commands https://wiki.archlinux.org/index.php/S.M.A.R.T.

=====================================================

My experience of replacing a failed drive...

to remove and add a new /sdb1 partition,
cat /proc/mdstat
does show the md0 is re-generated.

xfs_repair /dev/md0

xfs_repair -L /dev/md       Warning! This corrupted my data!



















========================================================================
It seems that even Linux needs a break (or poweroff/reboot cycle) after ~120 days/3 months, otherwise I saw disk errors that would paralyze the raid /dev/md0.

In the event of a crash and no disk is completely failed, here's the commands to re-activate the raid:

1. you notices md0 is not started during boot or mount
2. check dmesg and search for "XFS" or "md0"
3. Make sure all disks are still okay, I've seen these errors:
comreset failed (errno=-16)
link slow...

in this case, poweroff and let the server breathe and cool down a bit

4. Restart, when you see these errors disappear, time to re-assemble the raid
Check raid status:

allspice:/data # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[0] sdi1[8] sdh1[6] sdg1[5] sdf1[4] sde1[3] sdc1[1]
      13674526208 blocks super 1.0 level 5, 128k chunk, algorithm 0 [8/7] [UU_UUUUU]
      bitmap: 48/466 pages [192KB], 2048KB chunk

garlic:/home/dxu # cat /proc/mdstat

Personalities : [raid0]
md0 : active raid0 sdb1[0] sdt1[18] sds1[17] sdr1[16] sdq1[15] sdp1[14] sdo1[13] sdn1[12] sdm1[11] sdl1[10] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1]
      39041033280 blocks super 1.0 64k chunks

allspice:/data # mdadm --detail /dev/md0
/dev/md0:
        Version : 1.00
  Creation Time : Wed Dec  5 09:22:06 2012
     Raid Level : raid5
     Array Size : 13674526208 (13041.04 GiB 14002.71 GB)
  Used Dev Size : 1953503744 (1863.01 GiB 2000.39 GB)
   Raid Devices : 8
  Total Devices : 7
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Nov 19 16:54:51 2014
          State : active, degraded
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-asymmetric
     Chunk Size : 128K

           Name : allspice:0  (local to host allspice)
           UUID : 231e6db4:47179ae5:a623b946:f5c0689b
         Events : 50922

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       0        0        2      removed
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       8       97        5      active sync   /dev/sdg1
       6       8      113        6      active sync   /dev/sdh1
       8       8      129        7      active sync   /dev/sdi1

garlic:/home/dxu # mdadm --detail /dev/md0

/dev/md0:
        Version : 1.00
  Creation Time : Mon Sep 26 14:06:56 2011
     Raid Level : raid0
     Array Size : 39041033280 (37232.43 GiB 39978.02 GB)
   Raid Devices : 19
  Total Devices : 19
    Persistence : Superblock is persistent

    Update Time : Mon Sep 26 14:06:56 2011
          State : clean
 Active Devices : 19
Working Devices : 19
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           Name : garlic:0  (local to host garlic)
           UUID : 4ad68955:6c768a39:70bba032:20cddd15
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       8       97        5      active sync   /dev/sdg1
       6       8      113        6      active sync   /dev/sdh1
       7       8      129        7      active sync   /dev/sdi1
       8       8      145        8      active sync   /dev/sdj1
       9       8      161        9      active sync   /dev/sdk1
      10       8      177       10      active sync   /dev/sdl1
      11       8      193       11      active sync   /dev/sdm1
      12       8      209       12      active sync   /dev/sdn1
      13       8      225       13      active sync   /dev/sdo1
      14       8      241       14      active sync   /dev/sdp1
      15      65        1       15      active sync   /dev/sdq1
      16      65       17       16      active sync   /dev/sdr1
      17      65       33       17      active sync   /dev/sds1
      18      65       49       18      active sync   /dev/sdt1


Examine individual disks

allspice:/home/dxu # mdadm -E /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 231e6db4:47179ae5:a623b946:f5c0689b
           Name : allspice:0  (local to host allspice)
  Creation Time : Wed Dec  5 09:22:06 2012
     Raid Level : raid5
   Raid Devices : 8

 Avail Dev Size : 3907007664 (1863.01 GiB 2000.39 GB)
     Array Size : 27349052416 (13041.04 GiB 14002.71 GB)
  Used Dev Size : 3907007488 (1863.01 GiB 2000.39 GB)
   Super Offset : 3907007920 sectors
          State : clean
    Device UUID : 041fdebc:61921109:01eaa3a0:fe0a0ffa

Internal Bitmap : -240 sectors from superblock
    Update Time : Thu Nov 13 19:33:49 2014
       Checksum : be92f041 - correct
         Events : 23542

         Layout : left-asymmetric
     Chunk Size : 128K

   Device Role : Active device 2
   Array State : AAAAAAAA ('A' == active, '.' == missing)

Stop the raid before re-assemble
allspice:/home/dxu # mdadm --stop /dev/md0
mdadm: stopped /dev/md0

Now, re-activate
allspice:/home/dxu # mdadm -A --force /dev/md0 /dev/sd[bcdefghi]1
mdadm: forcing event count in /dev/sdi1(7) from 50893 upto 50906
mdadm: clearing FAULTY flag for device 7 in /dev/md0 for /dev/sdi1
mdadm: /dev/md0 has been started with 7 drives (out of 8)

cat /proc/mdstat   Check again

Other useful mdadm commands

mdadm --examine --scan
check the output against vi /etc/mdadm.conf 
they should the same
Comments