StorNext labels gone? DON’T panic!

Topic

In SAN environments labels are often used to address a specific LUN, rather than using the device identifiers or numbers, to simplify the addressing of devices. Plus, a device does not have to be tied to a unique device number/id.

Losing a label due to a crash, or a defective hardware part or simply because a client overwrote the label, leaves the administrator in a quite scary und definitely uncomfortable situation. Data can’t be accessed or could even be lost.

 

Situation

If you haven’t experienced a lost label from a specific LUN in a StorNext environment yourself, you have at least heard of stories where the label magically disappeared. The result is that no client can mount and therefore has no access to the volume with the missing LUN. No access to the data…? Where is the data…??? NO, don’t do this this to me! PANIC…?!! Or not…?

Troubleshoot

For starters, you have to realize that not the LUN is missing but the label – THAT is the root cause. Missing labels will cause Windows and OS/X clients to drop the connection to the affected file system, although the LUN is actually still present. While Linux cashes the labels and continues to work until the next reboot, it may lead to confusion. To be sure your problem is actually caused by a missing label, you can either reboot the Linux client or rescan the SCSI bus (i.e. scsi-rescan – –forcerescan will go out and rescan the fabric and drop cached LUNs – and therefore labels).

Having this cache purged, the next best command is “cvlabel”. Cvlabel has many options but in my experience the most valuable option is “-c” to provide information about the labels, device numbers and the serial numbers per LUN. Yes, the –L (long) option will give you serial numbers as well but the output of –c keeps the label name in the first row which simplifies the maintenance.

If the count of LUN’s shows the number you have expected, you can confirm that the issue is caused by a missing label, instead of a hardware issue (e.g. dead SFP, broken wire, etc). The amount of LUNs for the affected file system is based on the StorNext file system configuration file. While it was simple to count the LUNs based on the ASCII configuration, the XML based configuration is a bit trickier. To identify the missing label look for the keyword CvfsDisk_UNKNOWN in the output.

[root #12] cvlabel –c
data01 /dev/sdc 4678727647 EFI   # host 0 lun 0 sectors 4678727647 sector_size 512 inquiry [LSI     MR9260-16i     2.13] serial 600062B2005322C018CCC98B128FE95C
CvfsDisk_UNKNOWN /dev/sde 4678727647 EFI   # host 0 lun 0 sectors 4678727647 sector_size 512 inquiry [LSI     MR9260-16i     2.13] serial 600062B2005322C018CCC98B128FE38E
CvfsDisk_UNKNOWN /dev/sdf 4678727647 EFI   # host 0 lun 0 sectors 4678727647 sector_size 512 inquiry [LSI     MR9260-16i     2.13] serial 600062B2005322C018CCC98B128FE32C
CvfsDisk_UNKNOWN /dev/sdg 4678727647 EFI   # host 0 lun 0 sectors 4678727647 sector_size 512 inquiry [LSI     MR9260-16i     2.13] serial 600062B2005322C018CCC98B128FE3D1
meta /dev/sdb 217823199 EFI   # host 0 lun 0 sectors 217823199 sector_size 512 inquiry [LSI     MR9260-16i     2.13] serial 600062B2004DA2001938E48F0D6DCD5D

Solution

There have 3 options to reapply the label to a LUN(s):

Option 1: Lucky

If you’re lucky, there is a fairly recent output of cvlabel available that includes a record of the lost label. Simply compare the serial number from the current cvlabel –c output with the archived list from before and reapply the label. How to easily label a LUN will be explained further down.

[indeed-social-locker sm_list=’fb,tw,li,,pt’ sm_template=’ism_template_3′ sm_list_align=’horizontal’ sm_display_counts=’false’ sm_display_full_name=’false’ unlock_type=2 locker_template=6 sm_d_text=’

The solution is locked, please honor the work!

Share This Page To Unlock The Solution!’ ism_overlock=’blur’ ]

Option 2: Not so lucky but still …

You do not call yourself so lucky to have a record of all your labels. On the MetaDataController (MDC) look into /usr/cvfs/debug/nssdbg.out. That file will (or should, at least) contain records of the last successful disk discovery. Search for the serial number of the affected LUN and you should see something like this

[root #13] grep “600062B2005322C018CCC98B128FE38E” /usr/cvfs/debug/nssdbg.out
[0728 09:40:34] 0x2b8eec55f030 NOTICE PortMapper: CVFS Volume data00 on device: /dev/sdd (blk 0x890 raw 0x890) con: 0 lun: 0 state: 0x4 inquiry [LSI     MR9260-16i     2.13] controller # 'default' serial # 600062B2005322C018CCC98B128FE38E Size: 4678727647 Sector Size: 512

If you found the serial number inside the debug file you’re good, as you can proceed with re-applying the label to the LUN.

The Quantum web forum and knowledgebase will stop here and has no (publicly published) solution for this situation – but there is at least one more that I know of. So if you didn’t have luck with the two previous options, read on.

Option 3: Last resort

This is the last possibility to fix the issue before you might want to consider calling for professional help.

If you have NO former record of your SAN labels, and for some reason the debug file is corrupted or the MDS doesn’t see all the data LUN’s, you won’t find any record on the MDC. In general, the MDC can see the data LUN’s but it’s not a must, as the MDC only handles the metadata and therefore doesn’t really care whether the LUN’s are presented to it or not.

Assuming that an attached client has just stamped down the label and has written his own ID (Windows, OS/X or Linux) the label is as good as gone. However, since operating systems follow the rules and usually do not exceed the 4096 byte mark for a disk signature, there is still hope to recover the label name. Simply because StorNext writes more than 4k when creating a label. Some may argue that Windows will write beyond that 4096 byte mark which would mean that data could have been corrupted. Well, all I can say is that I’ve seen many labels been blown away – and never came across data loss due to a Windows signature.

However, if you have exhausted all other options, all you can do is reading the LUN header and read between the lines. Yes, not many people know this but in many cases it IS possible to recover the label name from an overwritten LUN.

As the label names can be found more than once in the first 20KB of a LUN, we are going to raw read specifically that part of the LUN – and cross our fingers that we’ll find the label again somewhere. I am going to use the octal dump (od) command on Linux on the device sde. This is a read only process and won’t alter your LUNs in any way.

[root #14] od -N20000 -c /dev/sde | less
0004000   ;   Z 224   j 322 035 262 021 231 246 \b \0       s   f   1
0004020   e 303 242   ( 276 362 341 021 223   X \0   0   H 334   <   T
0004040 017   3 027 021 \0 \0 \0 \0 016   s 027 021 \0 \0 \0 \0
0004060 \0 \0 \0 \0 \0 \0 \0 \0   S \0   N \0   F \0   S \0
0004100   - \0   d \0   a \0   t \0  a \0   0 \0   0 \0   0 \0
0004120   0 \0   2 \0   0 \0   0 \0   0 \0   0 \0   0 \0 \0 \0
*
0042000   d   a   t   a   0   0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0042020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0

Take a look at entries 0004100 and 0004200 where you can see your data00 entry. As we know what we looking for I use grep for a more conclusive output.

[root #15] od -N20000 -c /dev/sde | grep -i "d   a   t"
0042000   d   a   t   a   0   0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0

Read another LUN

[root #16] od -N20000 -c /dev/sdf | grep -i "d   a   t"
0042000   d   a   t   a   0   2 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0

Read two or more additional LUNs and compare the labels. If you are confident that this is the correct label you are missing, go ahead and apply it.

 

Re-applying a missing label

After you have hopefully identified the missing label name, here’s a straight forward process to re-apply the labels.

  1. create a label file: cvlabel –c | grep UNKNOWN > /tnp/new_label
  2. edit the label file and leave only the one LUN in the file which should be re-labeled
  3. run cvlabel and apply the label: cvlabel /tm/new_label

     /dev/sde [MR9260-16i     2.13] unknown  Controller 'default', Serial '600062B2005322C018CCC98B128FE38E', Sector Size 512, Sectors Max 4678727647
     Do you want to label it SNFS-VTOC - Name: data00 Sectors: 4678727647 (Y / N) -> y
     New Volume Label -Device: /dev/sde  SNFS Label: data00  Sectors: 4678727647.
    
    ..
    Done.  1 source lines.  1 labels.

The labels are done.

That’s it, now go ahead and mount the file system and make some thorough read tests. Just browsing is no indication that the data is healthy so you have to make visual spot tests…!

If there is visible data corruption in place, don’t write anything to that file system and seek for more advanced help.

Now we can PANIC!

 

Why does it happen anyway?

There are various, very common ways to blow away a LUN label. Very often, reinstalling a client while leaving the FC connection attached causes a LUN label to be overwritten.

Opening the Windows DiskManager or the OS/X DiskUtility to check how many devices your client sees can overwrite the labels as well. Even if you opt out by exiting the disk tool the client will write his disk signature onto the foreign disk, hence this can cause the loss of labels as the

Another way I’ve seen labels wiped is installing a Linux client automated through a kickstart file. If it isn’t specified in the kickstart file that only one device should be used as the OS drive and other devices should be ignored, Linux will go ahead and just take all the disks it can find to stomp on them.

And the 4th kind is simply human error aka PEBCAK. As all the LUN’s appear as local disks to the clients, tool like s/fdisk, tar or simple dd do not care about the device you are writing to if you have root permissions.

Follow these simple rules to avoid wiped labels:

  • unplug FC/SAS connections from the client before re-installing an OS
  • install StorNext software before adding the FC/SAS connection
  • keep a recent cvlabel output on the MDC for the “just in case” case
  • if you are working on a SAN connected client, be careful working with tools that could do any harm to your data LUN’s
  It is acknowledged by anyone who chooses to utilize this procedure that all risks taken in performing these steps are performed on your own and that I am not reliable for any damage.
[/indeed-social-locker]

3 COMMENTS

  1. Ran Pergamin September 13, 2015 at 11:39 pm Reply

    Great article Roger.

    Couple of important comments, I learned while recovering 2 x 300TB file systems last week:

    1. When using nssdbg.log file ensure that you take the latest (date/time) list of labels, cause there may be old relabeled devices. I grep-ed the relevant lines, imported to xls and took the latest.

    2. Labels are loaded on boot. If you have a client/MDCs mounted you can run cvadmin->disk and see the labels and paths. While this doesn’t have serials, if your paths are not changed, it can be another method to recover or at least verify.

    3. Another mean is to use the file system configuration file that has the labels in it. Again, no serials, but if you did name your SAN Luns to match your snfs labels (always good practice) this can really help you recover.

    • rbeck September 14, 2015 at 8:01 pm Reply

      Thanks for you reply Ran, all good points.

      1. That’s a vary good point to look for the newest entries in the log. Haven’t mentioned that in article.

      2, Cvadmin shows you the labels indeed but I think it’s not the issue that you don’t know the label name rather than not knowing which device had one. I see your point referring to the device and the label name here.

      3. The configuration file should be consulted every time especially if you use labels names which do not really carry the file system name. I.e. I prefer to use label names like “san1_data00” and label names like “lun0, lun1 etc” aren’t helpful at all.

  2. Lance Gropper November 19, 2015 at 5:53 pm Reply

    Hello Rogert:

    What about a situation where the MDC sees the labels, but a Linux client does not?

    Lance

Leave a Reply