Troubleshooting
XenServer deployments
Tomasz Czajka, Sr. Support Engineer
8th of October 2010
• Case Study: “Production down”
• Learn: “XenServer crash”
• Case study: “Singlepathing”
• Q & A
Agenda
“Production down”
Basic troubleshooting in XenCenter
VM don’t start - why?
• Cannot start a VM “The SR is not available” error
• Storage Repositry (SR) in “broken” state
“Repair” does not work.
Use CLI to troubleshoot
# xe pbd-list currently-atached=false PBD PBD
What is “broken”?
XenServer_1 XenServer_1 SR SR
XenServer_2 XenServer_2 PBD PBD
has UUID (unique ID)
SCSI ID
PBD = Physical Block Device
PBD PBD PBD PBD
SR SR SR SR
Volume Group Volume Group
Name: <Prefix>+SR UUID”
Broken storage
Goal: Reproduce and analyse the logs
Storage troubleshooting
/var/log/xensource.log* ; SMlog* ; messages* ;
# tail –f /var/log/messages > /tmp/ShortLog
# date
# echo “Unplugging cable” >> messages
messages (UTC) <> xensource.log (local)
Plugging PBD manually
PBD unplugged
# xe pbd-list host-uuid=... sr-uuid=...
# xe pbd-plug uuid=...
SR_BACKEND_FAILURE_47: The SR is not available no such volume group:
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4
# xe sr-list name-label=“My SR” params=uuid 19856cba-830c-e298-79fa-84a79eb658f4
# grep “PBD.plug” xensource.log
# grep “PBD.plug” xensource.log
Logical Volume (LV)
Logical Volume (LV)
What is VG?
Volume Group
Virtual Disk
Storage Repository HDD / LUN
Logical Volume Manager (LVM)
Volume Group (VG)
Volume Group (VG)
Physical Volume (PV)
Physical Volume (PV)
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Physical Volume (PV)
Physical Volume (PV)
Physical Volume (PV)
Physical Volume (PV)
Volume Group (VG)
Volume Group (VG)
HDD / LUN
HDD / LUN
3 VMs
1 virtual disk each 3 VMs
1 virtual disk each
SR SR
VDI VDI
VDI VDI
VDI VDI
Matching the UUID
Volume Group
# vgs
# vgs 'VG_XenStorage-19856cba-830c-e298-79fa- 84a79eb658f4'
Volume group "VG_XenStorage-19856cba-830c- e298-79fa-84a79eb658f4" not found
VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G
Checking SCSI ID
Examining HDD/LUN
• check SCSI ID (unique for each SCSI device)
# xe pbd-list params=device-config sr-uuid=...
device-config SCSIid: 360a9800050334f49633459
PBD PBD
SCSI ID
Can Linux kernel see this block device? (SCSI device)
Examining HDD/LUN
# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...
Timing buffered disk reads:
138 MB in 3.02 seconds = 45.68 MB/sec
(LUN readable! )
Addressing SCSI disks
# ls -lR /dev/disk | grep 360a9800050334f4963345767656c546
• /dev/disk/by-id
•scsi-360a9800050334f4963345767656c546a -> /dev/sde
•/dev/disk/by-scsibus
•360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc
•360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde /dev/mapper/360a9800050334f4963345767656c546
Also check /dev/disk/by-path
Is the LUN empty?
Examining HDD/LUN
# udevinfo -q all -n
/dev/disk/by-id/scsi-360a9800050334f496334576765...
...
ID_FS_TYPE=LVM2 member ...
“If this is LVM member, why there is no VG on it?”
Is there a VG created on PV?
Examining HDD/LUN
# pvs
# pvs |grep 360a9800050334f496334595a32306431
PV VG Fmt Attr Psize Free
/dev/mapper/360a9800050334f496334595a32306431
VG_XenStorage-332432-430d-3423-4332434-5485974 lvm2 a- 14.99G 14.99G
# xe sr-list name-label="My SR" params=uuid 19856cba-830c-e298-79fa-84a79eb658f4
VG_Xenstorage<UUID> differs from SR UUID !
PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G
Potential reasons:
No original VG on the LUN
• (Re)installation of host in the same pool
•
Unplug FC / Zoning
• (Re)installation of host in other pool
•
Zoning
• Adding SR with “xe sr-create” in CLI
...BE VERY CAREFUL!
...has been recreated!
Volume Group
• Lost LVM metadata
• Lost 100 MB of the VDI data Action steps:
• don’t shutdown running VMs
• Online backup for running Vms (now)
• Block-level clone of the whole LUN (now)
• Assess professional data recovery
Looking for LVM metadata backup
Volume Group
/etc/lmv/backup/VG_XenStorage-19856cba-830c- e298-79fa-84a79eb658f4
• Check backup timestamp (within the file)
LVs in backup file
# cat /etc/lvm/backup/VG...
| grep VHD
LVs in backup file
# cat /etc/lvm/backup/VG...
| grep VHD
VDI in xapi database
# xe vdi-list sr=<uuid>
params=uuid
VDI in xapi database
# xe vdi-list sr=<uuid>
params=uuid
=
Make a copy first
# cp /etc/lvm/backup/* /root/backup/
Make a copy first
# cp /etc/lvm/backup/* /root/backup/
LVLV LVLV
VDIVDI
VDIVDI VDIVDI
LVLV
Removing new VG and PV
Volume Group
# vgremove "VG_XenStorage-<new SR uuid>”
# pvremove
/dev/mapper/<SCSI ID>
Recreating PV and VG from backup
Volume Group
# pvcreate
--uuid <PV uuid from backup file>
--restorefile
/etc/lvm/backup/VG_XenStorage-<SR_UUID>
/dev/mapper/<SCSI ID>
# vgcfgrestore VG_XenStorage-<SR UUID>
-f /etc/lvm/backup/VG_XenStorage-<SR UUID>
Confirm that VG name contains SR uuid...
Examining HDD/LUN
# pvs |grep 360a9800050334f496334595a32306431
PV VG Fmt Attr Psize Free
/dev/mapper/360a9800050334f496334595a32306431
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G
# xe sr-list name-label="My SR" params=uuid 19856cba-830c-e298-79fa-84a79eb658f4
VG_Xenstorage<UUID> matches SR UUID
Checking Logical Volumes
Volume Group
# lvs
•MGT
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M
•VHD-352d31ec-aeb6-4601-8ea9-990575dab395
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
•VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G
•VHD-fbce18dd-397e-444e-9470-b6fa240243d9
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G
•VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98
VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Logical Volume (LV)
Plugging PBD again...
Storage Repository
# xe pbd-plug uuid=…
# xe sr-scan uuid=…
Error code: SR_BACKEND_FAILURE_46
Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]
# xe vdi-list uuid=<above number>
# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa- 84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32
# xe sr-scan uuid=…
Success! But no VDIs shown...
Success! All VDIs shown...
Well done!
...by troubleshooting “Production Down” issue
What we’ve learned
•PBD to get plugged needs...
•LUN/HDD PV VG (SR) LV (VDI)
•VG name generated from SR uuid (+ prefix) LV name generated from VDI uuid (+ prefix)
•Displaying VG (vgs), PV (pvs), LV (lvs)
•Addressing block devices (/dev/disk)
•Examining HDD/LUN with "hdparm –t"
•Restoring PV & VG from backup
“The XenServer Crash”
Unresponsive or rebooting host
The XenServer Crash?
• Kernel panic or crash dump
•
Error on Console, host locked
•
Memory addressing, Bug in OS, Hardware failure
• No Kernel Panic and no crash dump
•
Host rebooting / frozen / no errors on the console
•
Hardware failure, OS busy (I/O), user action
/var/crash/<date> exists
Symptom: Host rebooted itself Symptom: Host is unresponsive
Serial console Serial console
Review crashdump
HA enabled HA enabled
Host fenced?
Check /var/log/xha.log Disable HA
HA disabled HA disabled
Add „noreboot”
option in extlinux.conf Add „noreboot”
option in extlinux.conf
Analyse /var/log/
messages, xensource.log
Still rebooting? examine hardware Still rebooting? examine hardware
No serial console No serial console Connect local console Connect local console Any errors on the console?
Any errors on the console?
Analyse /var/log/messages, xensource.log for HA reasons
Boot the host to the console CTX120540 & reboot Boot the host to the console
CTX120540 & reboot
Generate crashdump CTX120540 & reboot Generate crashdump
CTX120540 & reboot
Review crashdump
Analyse /var/log/
messages, xensource.log
Take photos and reboot Take photos and reboot
Contact Citrix Tech Support Contact Citrix Tech Support
No crashdump
Startup strings:
# cd /var/log
# grep “klogd” messages -B100
# grep “SERVER START” xensource.log -B100
As easy as grepGetting into details…
Analyse /var/log/messages, xensource.log
/var/crash/<stamp>
Inside crash log directory
Citrix Confidential - Do Not Distribute
Domain0.log Domain0.log
Hypervisor console ring
Hypervisor console ring
Domain0 console ring
Domain0 console ring crash.log
crash.log
CPU stack - to be analysed by Citrix Tech Support CPU stack - to be analysed by Citrix Tech Support
HA activity,
page fault, driver, storage issues HA activity,
page fault, driver, storage issues
Review crashdump
Domain1,2,3...log Debug.log
xen-memory-dump Domain1,2,3...log
Debug.log
xen-memory-dump
XenConsole ring
Investigating crash.log
•
located at the bottom of the file
•
(XEN) Watchdog timer fired for domain 0
(XEN) Domain 0 shutdown: watchdog rebooting machine.
•
Why watchdog triggered?
/var/log/xha.log (Network or Storage heartbeat failed)
•
Why heartbeat failed?
/var/log/messages (DMP, kernel, drivers, I/O errors)
Review crashdump (cont)
Page fault
Investigating crash.log
Other examples:
• (XEN) ****************************************
• (XEN) Panic on CPU 6:
• (XEN) FATAL TRAP: vector = 14 (page fault)
• (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
• (XEN)
• ****************************************
• (XEN)
• (XEN) Reboot in five seconds...
Learn: XenServer crash
What we’ve learned
•
Host really crashed?
•
Kernel Panic
•
Crashdump
•
Triggering Crashdump manually
•
Locating host reboot in the logs
•
Reviewing crashdump logs
“Single-Pathing”
Storage Performance issue
• DMP has been enabled to improve performance
• Virtual Machines are running on different iSCSI SRs
LinuxGuestVM:~# hdparm -t /dev/xvdb /dev/xvdb:
Timing buffered disk reads: 96 MB in 3.07 seconds =
30.41 MB/sec
Checking multipath status
Storage Performance
# mpathutil status
360a9800050334f496334596c71665246 dm-13 NETAPP,LUN [size=2.0G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=4][enabled]
\_ 3:0:0:2 sdk 8:160 [active][ready]
\_ 4:0:0:2 sdj 8:144 [active][ready]
/dev/mapper/....
/dev/
Determining current performance on domain0
Storage Performance
• Testing multi-path device
# hdparm /dev/mapper/<scsi id>
• Testing single-path devices
# hdparm /dev/sdj
# hdparm /dev/sdm
In all cases: 30 MB/sec
Determining usage of paths
Storage Performance
# iostat –x <device>
# iostat –x /dev/sdk /dev/sdj 5
Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdk 803.50 33.0 4122 160
sdj 784.00 32.8 3922 155
Both paths are used equally
Checking if there are really 2 iSCSI sessions
Storage Performance
# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"
ip-10.1.200.40:3260-iscsi-iqn.1992-
08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk ip-10.1.201.40:3260-iscsi-iqn.1992-
08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj
Checking if different paths are really used
Storage Performance
# tcpdump -i any port 3260
# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "
eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C
RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB) eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E
RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)
Checking source IP addresses for iSCSI sessions
Storage Performance
# netstat -at | grep iscsi
10.1.
200.138
:53049 10.1.200
.40 :iscsi-target ESTABLISHED
10.1.200.178
:46684 10.1.201
.40 :iscsi-target ESTABLISHED
Checking kernel routing table
Storage Performance
# route
Destination Gateway Genmask Iface
10.1.200.0 * 255.255.255.0 xenbr0
10.1.200.0 * 255.255.255.0 xenbr1
default 10.1.200.1 0.0.0.0 xenbr0
Configuration of management interfaces in XenCenter
Storage Performance
Modify ISCSI_2 into 10.1.201.78
Determining current performance on domain0
Storage Performance
# route
Destination Gateway Genmask Iface
10.1.200.0 * 255.255.255.0 xenbr0
10.1.201.0 * 255.255.255.0 xenbr1
default 10.1.200.1 0.0.0.0 xenbr0
Configuring kernel routing table
Storage Performance
...or (not recommended)
• Add to /etc/rc.local
# route add -host 10.1.200.40 xenbr0
# route add -host 10.1.201.40 xenbr1
• What about Pool Upgrade and Pool Join?
LinuxVM:~# hdparm -t /dev/xvdb
/dev/xvdb:
Timing buffered disk reads: 45 MB/sec
Well Done!
Determining current performance on VM
Storage Performance
Case study: Single-pathing
What we’ve learned
•
/dev/ locations for single and multi-path devices
•
# mpathutil status
•
# hdparm –t
•
# iostat
•
# ifconfig, # tcpdump, # netstat, # route
•
# watch
•
Best practices for iSCSI storages
Questions
First aid kit
Resources
•
http://docs.xensource.com –XenServer documentation
• http://support.citrix.com/product/xens/ - Knowledge Center
• http://forums.citrix.com/support - Support forums
• http://community.citrix.com/citrixready/xenserver - XenServer
Central (one-stop information center)
Before you leave…
• Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October
• Provide your feedback and pick up a complimentary gift card at the registration desk