• Aucun résultat trouvé

Troubleshooting XenServer deployments

N/A
N/A
Protected

Academic year: 2022

Partager "Troubleshooting XenServer deployments"

Copied!
49
0
0

Texte intégral

(1)

Troubleshooting

XenServer deployments

Tomasz Czajka, Sr. Support Engineer

8th of October 2010

(2)

• Case Study: “Production down”

• Learn: “XenServer crash”

• Case study: “Singlepathing”

• Q & A

Agenda

(3)

“Production down”

(4)

Basic troubleshooting in XenCenter

VM don’t start - why?

• Cannot start a VM  “The SR is not available” error

• Storage Repositry (SR) in “broken” state

“Repair” does not work.

Use CLI to troubleshoot

(5)

# xe pbd-list currently-atached=false PBD PBD

What is “broken”?

XenServer_1 XenServer_1 SR SR

XenServer_2 XenServer_2 PBD PBD

has UUID (unique ID)

SCSI ID

PBD = Physical Block Device

PBD PBD PBD PBD

SR SR SR SR

Volume Group Volume Group

Name: <Prefix>+SR UUID”

Broken storage

(6)

Goal: Reproduce and analyse the logs

Storage troubleshooting

/var/log/xensource.log* ; SMlog* ; messages* ;

# tail –f /var/log/messages > /tmp/ShortLog

# date

# echo “Unplugging cable” >> messages

messages (UTC) <> xensource.log (local)

(7)

Plugging PBD manually

PBD unplugged

# xe pbd-list host-uuid=... sr-uuid=...

# xe pbd-plug uuid=...

SR_BACKEND_FAILURE_47: The SR is not available no such volume group:

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4

# xe sr-list name-label=“My SR” params=uuid 19856cba-830c-e298-79fa-84a79eb658f4

# grep “PBD.plug” xensource.log

# grep “PBD.plug” xensource.log

(8)

Logical Volume (LV)

Logical Volume (LV)

What is VG?

Volume Group

Virtual Disk

Storage Repository HDD / LUN

Logical Volume Manager (LVM)

Volume Group (VG)

Volume Group (VG)

Physical Volume (PV)

Physical Volume (PV)

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

Physical Volume (PV)

Physical Volume (PV)

Physical Volume (PV)

Physical Volume (PV)

Volume Group (VG)

Volume Group (VG)

HDD / LUN

HDD / LUN

3 VMs

1 virtual disk each 3 VMs

1 virtual disk each

SR SR

VDI VDI

VDI VDI

VDI VDI

(9)

Matching the UUID

Volume Group

# vgs

# vgs 'VG_XenStorage-19856cba-830c-e298-79fa- 84a79eb658f4'

Volume group "VG_XenStorage-19856cba-830c- e298-79fa-84a79eb658f4" not found

VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G VG #PV #LV #SN Attr VSize VFree VG_XenStorage-090d4717-9f91-92de-83c3-5458274802e9 1 18 0 wz--n- 89.99G 19.48G VG_XenStorage-5239de43-6a74-0365-f825-b799aa6de853 1 2 0 wz--n- 129.07G 129.05G VG_XenStorage-70a029cf-7f35-c035-4af7-07eaf31e2e88 1 11 0 wz--n- 49.99G 2.84G VG_XenStorage-9be18df5-3fd2-4835-b864-d0ffbccbaeb3 1 1 0 wz--n- 1.99G 1.98G

(10)

Checking SCSI ID

Examining HDD/LUN

• check SCSI ID (unique for each SCSI device)

# xe pbd-list params=device-config sr-uuid=...

device-config SCSIid: 360a9800050334f49633459

PBD PBD

SCSI ID

(11)

Can Linux kernel see this block device? (SCSI device)

Examining HDD/LUN

# hdparm -t /dev/disk/by-id/scsi-360a98045234t654...

Timing buffered disk reads:

138 MB in 3.02 seconds = 45.68 MB/sec

(LUN readable! )

(12)

Addressing SCSI disks

# ls -lR /dev/disk | grep 360a9800050334f4963345767656c546

/dev/disk/by-id

•scsi-360a9800050334f4963345767656c546a -> /dev/sde

•/dev/disk/by-scsibus

•360a9800050334f4963345767656c546a-1:0:0:5 -> /dev/sdc

•360a9800050334f4963345767656c546a-2:0:0:5 -> /dev/sde /dev/mapper/360a9800050334f4963345767656c546

Also check /dev/disk/by-path

(13)

Is the LUN empty?

Examining HDD/LUN

# udevinfo -q all -n

/dev/disk/by-id/scsi-360a9800050334f496334576765...

...

ID_FS_TYPE=LVM2 member ...

“If this is LVM member, why there is no VG on it?”

(14)

Is there a VG created on PV?

Examining HDD/LUN

# pvs

# pvs |grep 360a9800050334f496334595a32306431

PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431

VG_XenStorage-332432-430d-3423-4332434-5485974 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid 19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> differs from SR UUID !

PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G PV VG Fmt Attr Psize Free /dev/mapper/360a9800050334f4963 VG_XenStorage-090d4717-9f91-92de-83c3- lvm2 a- 89.99G 19.48G /dev/mapper/360a9800050334f4963 VG_XenStorage-70a029cf-7f35-c035-4af7- lvm2 a- 49.99G 2.84G /dev/mapper/360a9800050334f4963 VG_XenStorage-19856cba-830c-e298-79fa- lvm2 a- 14.99G 6.45G /dev/mapper/360a9800050334f4965 VG_XenStorage-9be18df5-3fd2-4835-b864- lvm2 a- 1.99G 1.98G /dev/sda3 VG_XenStorage-5239de43-6a74-0365-f825- lvm2 a- 129.07G 129.05G

(15)

Potential reasons:

No original VG on the LUN

• (Re)installation of host in the same pool

Unplug FC / Zoning

• (Re)installation of host in other pool

Zoning

• Adding SR with “xe sr-create” in CLI

...BE VERY CAREFUL!

(16)

...has been recreated!

Volume Group

• Lost LVM metadata

• Lost 100 MB of the VDI data Action steps:

don’t shutdown running VMs

• Online backup for running Vms (now)

• Block-level clone of the whole LUN (now)

• Assess professional data recovery

(17)

Looking for LVM metadata backup

Volume Group

/etc/lmv/backup/VG_XenStorage-19856cba-830c- e298-79fa-84a79eb658f4

• Check backup timestamp (within the file)

LVs in backup file

# cat /etc/lvm/backup/VG...

| grep VHD

LVs in backup file

# cat /etc/lvm/backup/VG...

| grep VHD

VDI in xapi database

# xe vdi-list sr=<uuid>

params=uuid

VDI in xapi database

# xe vdi-list sr=<uuid>

params=uuid

=

Make a copy first

# cp /etc/lvm/backup/* /root/backup/

Make a copy first

# cp /etc/lvm/backup/* /root/backup/

LVLV LVLV

VDIVDI

VDIVDI VDIVDI

LVLV

(18)

Removing new VG and PV

Volume Group

# vgremove "VG_XenStorage-<new SR uuid>”

# pvremove

/dev/mapper/<SCSI ID>

(19)

Recreating PV and VG from backup

Volume Group

# pvcreate

--uuid <PV uuid from backup file>

--restorefile

/etc/lvm/backup/VG_XenStorage-<SR_UUID>

/dev/mapper/<SCSI ID>

# vgcfgrestore VG_XenStorage-<SR UUID>

-f /etc/lvm/backup/VG_XenStorage-<SR UUID>

(20)

Confirm that VG name contains SR uuid...

Examining HDD/LUN

# pvs |grep 360a9800050334f496334595a32306431

PV VG Fmt Attr Psize Free

/dev/mapper/360a9800050334f496334595a32306431

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 lvm2 a- 14.99G 14.99G

# xe sr-list name-label="My SR" params=uuid 19856cba-830c-e298-79fa-84a79eb658f4

VG_Xenstorage<UUID> matches SR UUID 

(21)

Checking Logical Volumes

Volume Group

# lvs

MGT

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.00M

VHD-352d31ec-aeb6-4601-8ea9-990575dab395

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

VHD-fbce18dd-397e-444e-9470-b6fa240243d9

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 4.02G

VHD-ff744448-1b7f-4cc8-80b1-cd38b6c90c98

VG_XenStorage-19856cba-830c-e298-79fa-84a79eb658f4 -wi--- 520.00M

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

Logical Volume (LV)

(22)

Plugging PBD again...

Storage Repository

# xe pbd-plug uuid=…

# xe sr-scan uuid=…

Error code: SR_BACKEND_FAILURE_46

Error parameters: , The VDI is not available [opterr=Error scanning VDI 7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32]

# xe vdi-list uuid=<above number>

# lvremove /dev/VG_XenStorage-19856cba-830c-e298-79fa- 84a79eb658f4/VHD-7e5f83a7-b6c4-4fae-9899-1e6a2cdabd32

# xe sr-scan uuid=…

Success! But no VDIs shown...

Success! All VDIs shown...

Well done! 

(23)

...by troubleshooting “Production Down” issue

What we’ve learned

PBD to get plugged needs...

LUN/HDD PV VG (SR) LV (VDI)

VG name generated from SR uuid (+ prefix) LV name generated from VDI uuid (+ prefix)

Displaying VG (vgs), PV (pvs), LV (lvs)

Addressing block devices (/dev/disk)

Examining HDD/LUN with "hdparm –t"

Restoring PV & VG from backup

(24)

“The XenServer Crash”

(25)

Unresponsive or rebooting host

The XenServer Crash?

• Kernel panic or crash dump

Error on Console, host locked

Memory addressing, Bug in OS, Hardware failure

• No Kernel Panic and no crash dump

Host rebooting / frozen / no errors on the console

Hardware failure, OS busy (I/O), user action

(26)

/var/crash/<date> exists

Symptom: Host rebooted itself Symptom: Host is unresponsive

Serial console Serial console

Review crashdump

HA enabled HA enabled

Host fenced?

Check /var/log/xha.log Disable HA

HA disabled HA disabled

Add „noreboot”

option in extlinux.conf Add „noreboot”

option in extlinux.conf

Analyse /var/log/

messages, xensource.log

Still rebooting?  examine hardware Still rebooting?  examine hardware

No serial console No serial console Connect local console Connect local console Any errors on the console?

Any errors on the console?

Analyse /var/log/messages, xensource.log for HA reasons

Boot the host to the console CTX120540 & reboot Boot the host to the console

CTX120540 & reboot

Generate crashdump CTX120540 & reboot Generate crashdump

CTX120540 & reboot

Review crashdump

Analyse /var/log/

messages, xensource.log

Take photos and reboot Take photos and reboot

Contact Citrix Tech Support Contact Citrix Tech Support

No crashdump

(27)

Startup strings:

# cd /var/log

# grep “klogd” messages -B100

# grep “SERVER START” xensource.log -B100

As easy as grep

Getting into details…

Analyse /var/log/

messages, xensource.log

(28)

/var/crash/<stamp>

Inside crash log directory

Citrix Confidential - Do Not Distribute

Domain0.log Domain0.log

Hypervisor console ring

Hypervisor console ring

Domain0 console ring

Domain0 console ring crash.log

crash.log

CPU stack - to be analysed by Citrix Tech Support CPU stack - to be analysed by Citrix Tech Support

HA activity,

page fault, driver, storage issues HA activity,

page fault, driver, storage issues

Review crashdump

Domain1,2,3...log Debug.log

xen-memory-dump Domain1,2,3...log

Debug.log

xen-memory-dump

(29)

XenConsole ring

Investigating crash.log

located at the bottom of the file

(XEN) Watchdog timer fired for domain 0

(XEN) Domain 0 shutdown: watchdog rebooting machine.

Why watchdog triggered?

 /var/log/xha.log (Network or Storage heartbeat failed)

Why heartbeat failed?

 /var/log/messages (DMP, kernel, drivers, I/O errors)

Review crashdump (cont)

(30)

Page fault

Investigating crash.log

Other examples:

(XEN) ****************************************

(XEN) Panic on CPU 6:

(XEN) FATAL TRAP: vector = 14 (page fault)

(XEN) [error_code=0000] , IN INTERRUPT CONTEXT

(XEN)

****************************************

(XEN)

(XEN) Reboot in five seconds...

(31)

Learn: XenServer crash

What we’ve learned

Host really crashed?

Kernel Panic

Crashdump

Triggering Crashdump manually

Locating host reboot in the logs

Reviewing crashdump logs

(32)

“Single-Pathing”

(33)

Storage Performance issue

• DMP has been enabled to improve performance

• Virtual Machines are running on different iSCSI SRs

LinuxGuestVM:~# hdparm -t /dev/xvdb /dev/xvdb:

Timing buffered disk reads: 96 MB in 3.07 seconds =

30.41 MB/sec

(34)

Checking multipath status

Storage Performance

# mpathutil status

360a9800050334f496334596c71665246 dm-13 NETAPP,LUN [size=2.0G][features=0][hwhandler=0][rw]

\_ round-robin 0 [prio=4][enabled]

\_ 3:0:0:2 sdk 8:160 [active][ready]

\_ 4:0:0:2 sdj 8:144 [active][ready]

/dev/mapper/....

/dev/

(35)

Determining current performance on domain0

Storage Performance

• Testing multi-path device

# hdparm /dev/mapper/<scsi id>

• Testing single-path devices

# hdparm /dev/sdj

# hdparm /dev/sdm

In all cases: 30 MB/sec

(36)

Determining usage of paths

Storage Performance

# iostat –x <device>

# iostat –x /dev/sdk /dev/sdj 5

Device Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sdk 803.50 33.0 4122 160

sdj 784.00 32.8 3922 155

Both paths are used equally

(37)

Checking if there are really 2 iSCSI sessions

Storage Performance

# ls -alR /dev/disk/by-path/ | egrep "(sdk|sdj)"

ip-10.1.200.40:3260-iscsi-iqn.1992-

08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdk ip-10.1.201.40:3260-iscsi-iqn.1992-

08.com.netapp:MyNetapp.luns-lun-2 -> ../../sdj

(38)

Checking if different paths are really used

Storage Performance

# tcpdump -i any port 3260

# watch "ifconfig | egrep '(eth0 |eth1 )' -A5 | egrep '(eth|bytes)' "

eth0 Link encap:Ethernet HWaddr 00:1D:09:70:88:2C

RX bytes:1490076463 (1.3 GiB) TX bytes:170615419 (162.7 MiB) eth1 Link encap:Ethernet HWaddr 00:1D:09:70:88:2E

RX bytes:1801238 (166 MiB) TX bytes:46695876 (44.5 MiB)

(39)

Checking source IP addresses for iSCSI sessions

Storage Performance

# netstat -at | grep iscsi

10.1.

200.138

:53049 10.1.

200

.

40 :iscsi-target ESTABLISHED

10.1.

200.178

:46684 10.1.

201

.

40 :iscsi-target ESTABLISHED

(40)

Checking kernel routing table

Storage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.200.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

(41)

Configuration of management interfaces in XenCenter

Storage Performance

Modify ISCSI_2 into 10.1.201.78

(42)

Determining current performance on domain0

Storage Performance

# route

Destination Gateway Genmask Iface

10.1.200.0 * 255.255.255.0 xenbr0

10.1.201.0 * 255.255.255.0 xenbr1

default 10.1.200.1 0.0.0.0 xenbr0

(43)

Configuring kernel routing table

Storage Performance

...or (not recommended)

• Add to /etc/rc.local

# route add -host 10.1.200.40 xenbr0

# route add -host 10.1.201.40 xenbr1

• What about Pool Upgrade and Pool Join?

(44)

LinuxVM:~# hdparm -t /dev/xvdb

/dev/xvdb:

Timing buffered disk reads: 45 MB/sec

Well Done!

Determining current performance on VM

Storage Performance

(45)

Case study: Single-pathing

What we’ve learned

/dev/ locations for single and multi-path devices

# mpathutil status

# hdparm –t

# iostat

# ifconfig, # tcpdump, # netstat, # route

# watch

Best practices for iSCSI storages

(46)

Questions

(47)

First aid kit

Resources

http://docs.xensource.com –XenServer documentation

• http://support.citrix.com/product/xens/ - Knowledge Center

• http://forums.citrix.com/support - Support forums

• http://community.citrix.com/citrixready/xenserver - XenServer

Central (one-stop information center)

(48)

Before you leave…

• Session surveys are available online at www.citrixsynergy.com starting Thursday, 7 October

Provide your feedback and pick up a complimentary gift card at the registration desk

• Download presentations starting Friday, 15 October, from your My

Organiser Tool located in your My Synergy Microsite event account

(49)

Références

Documents relatifs

At this point, all Xbox 360s shipped before August 2009 were vulnerable to exploit. However, around the same time Microsoft, likely in expectation of complete current

REMOTE USER Keyswitch Position REMOTE USER indicator blinks when the remote station is not logically connected.. USER PORT Keyswitch Position RSC is

established, the operator disables the console remote port, all efforts made at the console keyboard to re-enable the port and re-establish the logical connection again will

As the principal component, the console typewriter transmits and re- cords instructions and data between the operator and the computer and is used to perform

ZAStm for Zilog mnemonics, includes linker ZLIBt'n, compatible with ASM, MAC, RMAC, M80, and librarian L80; assembly 55supports ZCPR3 flow control conditional error testing; $95.00

Anytime the logic is enabled (Logic Settings switch # 7 is on and switch # 8 is true, or with the B Logic Interface option installed and that input is active), the Logic Active

Common the other side of each switch (Switch Common on the control panel illustration) to Logic Ground.. The Logic Active Tally output controls the lamps for the Cough and

When active low logic (pull to ground) is used by the peripheral device, Ready (+) and Audio Reset (+) connect to the logic supply voltage on the peripheral.. Ready (-) and Audio