Galaxy and ProtecTIER issues.

Today,I’ve come across some problems when configuring a Commvault Galaxy server to use the Diligent’s Virtual Tape Facility including ProtecTIER. I thought you’d be interested.

This applies to the Commvault Galaxy Media Agents for Windows for as far as I know.
Another criteria, is that you have chosen to configure drives to be spread across two or more fibrechannel connections.

When doing a standard detection in Galaxy, the library device and tape units are properly detected, and the units can be automatically configured into the corresponding drive slots in the library. While this is the preferred method according to Windows/Galaxy users, (clickity click :-) ) this does not result into the desired configuration though.

The symptoms are as follows:
When doing tape IO in Galaxy, Galaxy expects carts in certain tape units, but they are mounted in another unit, resulting in corresponding SCSI errors, like LOAD_OR_UNLOAD_FAILURES. In some cases, the exhaustive detection and validations also fails.

The reason:
When defining a new virtual library in the ProtecTIER GUI, and have chosen to spread the drives across two or more channels, Diligent has decided to use a naming scheme that uses add drive number assignments on one path, and even drive letter assignments on the other path. I have not been able to check the naming scheme when using more than two paths. But my best guess is that the problems will be similar.
The LUN id’s are numbered in consecutive order, from 0 (or 1) to n, on one path, and 0 (or 1) to n on the other path. Whether or not the LUN numbers start at zero is dependent on which port you are configuring the library controller device on. The library device apparently is always LUN ID zero.

For the example, lets make 6 drives, spread across two paths (controllers).

Port 0, holds :

  • Library Device, LUN 0
  • Drive 0, LUN 1
  • Drive 2, LUN 2
  • Drive 4, LUN 3

Port 1, holds :

  • Drive 1, LUN 0
  • Drive 3, LUN 1
  • Drive 5, LUN 2

You could say this is a nice config, because most of the load will be spread across the two paths. Our experience tells us, this is indeed working well. I need to point out the current experience was based on TSM on AIX.  TSM on AIX automatically detects in which library  drive slot the tape drive actually sits. Galaxy messes this up though.

In the above screenshot of the drives, the element address of the drive slot (in the GUI it’s called “Address” ) is listed in the last column.
When taking a close look at this column, you should see that this list is consecutive also, but always starts at Address 16. You should also notice, that this list is hopping paths. When I sort on the ” Port”  column, you will see, that all the even numbered Addresses are listed together, as well as the odd numbered Addresses. This tells me that the programmers at Diligent were at least consequent in their design and naming schemes.

This is also what apparently imposes the  problem on Galaxy.
When doing your scsi discovery, normal operations scan one bus at a time, resulting in a list in tape drives (LUN’s) that are consecutive by port. During the discovery of the library device, inventory on that library also tells us, or Galaxy in this case, that the library holds n tape slots, as well as n drive slots, n drives and such.

The drive slots are consecutive according to  Galaxy, because the info Galaxy goes by is provided by the library device, and not by scsi inquiry commands. When using the Galaxy auto config features, Galaxy will match the drives to the drive slots, in the order they are detected.

Thus,

  • \.\tape0 (LUN 1 port 0) is matched to Drive Slot 1 (Address 16)
  • \.\tape1 (LUN 2 port 0) is matched to Drive Slot 2 (Address 17)
  • \.\tape2 (LUN 3 port 0) is matched to Drive Slot 3 (Address 18)
  • \.\tape3 (LUN 0 port 1) is matched to Drive Slot 4 (Address 19)
  • \.\tape4 (LUN 1 port 1) is matched to Drive Slot 5 (Address 20)
  • \.\tape5 (LUN 2 port 1) is matched to Drive Slot 6 (Address 21)

But, according to ProtecTIER, the drives and slots (Addresses) are matched hopping ports.
So, this is what it should look like.

  • \.\tape0 (LUN 1 port 0) is matched to Drive Slot 1 (Address 16)
  • \.\tape1 (LUN 2 port 0) is matched to Drive Slot 3 (Address 18)
  • \.\tape2 (LUN 3 port 0) is matched to Drive Slot 5 (Address 20)
  • \.\tape3 (LUN 0 port 1) is matched to Drive Slot 2 (Address 17)
  • \.\tape4 (LUN 1 port 1) is matched to Drive Slot 4 (Address 19)
  • \.\tape5 (LUN 2 port 1) is matched to Drive Slot 6 (Address 21)

Below is a Galaxy screenshot of a partially configured tape library. Here you can also notice the odd/even numbered features. In this screenshot I renamed the Drive aliasses to match the ProtecTIER naming of Drives. Elm is short for element, and indicated the Address column in the ProtecTIER GUI.

So, you need to do a manual matching (moving in the Galaxy Library and Drive configuration tool) between the drives and drive slots.
Use the ProtecTIER GUI (Drives tree within the Library) to track back the port number, LUN number and Drive Slot (Address).
Once you’ve tackled this problem, everything should work like a charm.

This post isn’t intended to be used as a manual, but merely as a reference for possible problems you might run into when trying to configure or troubleshoot the Galaxy configuration in combination with Diligent’s  VTF/ProtecTIER. As most virtual libraries use the most common devices and methods, this post could possible apply to more virtual libraries in combination with Galaxy.

I hope is has some use for you.

Share

About The Author

Ilja Coolen

Other posts by

Author his web site

31

01 2007

11 Comments Add Yours ↓

The upper is the most recent comment

  1. Storagezilla #
    1

    Question.
    Why are you drive sharing a single library? Doesn’t that fact that you’re using a VTL allow you to create multiple libraries? And if you lost port 0 where the library device is located won’t that make port 1 redundant as the backup app up won’t have any access to the “virtual robotics” for media moves?

    Yes you could just mask the devices on the dead path down the remaining path but aren’t you risking 100% failure of all your backup jobs until you’ve done that and rescanned your devices?

    In order to spread load wouldn’t it be a better idea to create an autochanger on each path and have your backup app manage which client streams get written to different media in each autochanger?

    That way should a single path die on you you’d only have to rerun a maximum of 50% of your backups if you’ve balanced across both autochangers?

    I’m just trying to understand why you’ve things set up the way you have.
    Licensing maybe?

  2. Storagezilla #
    2

    And my apologies, what you’re doing is Library Sharing down two paths to the same host, not drive sharing. I noticed that error as I was reading it back after it had been submitted. -rolls eyes-

  3. 3

    My apologies too.

    I should have explained the layout. But I am affraid it’s not drive sharing or libray sharing. Neither Galaxy, nor the Diligent appliance support path failover of the drives. It’s all single pathed, and a path failure could indeed result in loss of drives and library.

    The Diligent appliance supports clustering, but this is limited to the appliance itself. I can’t recall exactly the specs, but the protecTIER cluster doesn’t protect us from path failures. Please someone correct me if I am wrong.

    By spreading drives across paths, I litterly mean spreading. Drive 0 on path 0, drive 1 on path 1, drive 2 on path 0, drive 3 on path 1 and so on….
    If it were a path-failover design, the problems I described above, would most probably not be at hand. Thus, licensing does not play any part in this, as it is merely based on the number of FC ports you are intending to use.

    I agree that this setup/design of virtual tape configurations isn’t the best in regards to availability. But to my (perhaps limited) knowledge, most virtual tape appliances available today don’t support drive path failover.
    Next to the appliance, the host OS/backup software also has to support multipathing tape devices. I know of AIX/atape drivers supporting 3590, LTO (and some others) drives using alternate_pathing because we use it, but I am not familiar with other implementations, although I am certain they do exist.

    The Diligent VTL virtualizes DLT7000 drives, which by design don’t offer multipath attachments, as far as I know.

  4. 4

    PS: I am sorry about the messed up (to wide) screenshot above. I wil ltry to make a new smaller but still readable one.

  5. 5

    Re the wide screenshot – may be leave out the “loaded” column

  6. tim #
    6

    IIRC, by the very design of tape there’s no way to have tape failover. Is there someone who actually does so? The only way I can see it happening is with a kludge of client-side and storage-side apps.

  7. 7

    Well Tim, we do multipathing to our IBM3590 drives. IBM's AIX Atape drivers support alternate pathing and failover and even loadbalancing.
    We don't use loadbalancing though.

    For 3590 drives: Read chapters 6, 19, 20 for multipathing and loadbalancing in this manual.

    ftp://ftp.software.ibm.com/storage/devdrvr/Doc/IBM_TotalStorage_tape_IUG.pdf
    Chapter 44 goes into the Windows multipathing.

    For LTO (Ultrium) drives: Check chapter 7 in this
    ftp://ftp.software.ibm.com/storage/devdrvr/Doc/IBM_ultrium_tape_IUG.pdf manual for some details.
    Chapter 43 contains info on path failover for Windows.

    These examples apply to the IBM drives and drivers only, but I am sure the industries competition has, or is working on similar solutions.

  8. Storagezilla #
    8

    Aha, it’s clearer when I read it not having been staring at a display for 16 hours before hand. :)

    I’m still wondering why you chose to go with just one virtual autochanger?

  9. 9

    We use only one autochanger, because we share one large virtual libray across two TSM servers and some storage agents. This works very well for us. We made a second library for our Galaxy environment. Those two don’t work well together in one library. At the moment, dual changers in one (virtual) Diligent library ar not an option.
    To my opinion, the Diligent solution is not the one to put into an environment where the highest availability is required. Our TSM tape backend doesn’t require this level of availability, so it works for us.

    When looking at path failures, tho most common cause is sfp or cable failure. I believe second place is for human error (eg. zoning).
    These kind of failures are quite easy to overcome. Recable, rezone, or replace the failing sfp. When a switch fails, you should be able to plug the link into a free port of a remaining (surviving) switch. Which one should always have. Re-assinging drives and changers to another port is also fairly easy (in Diligent’s case).

    I guess this post triggered the issue of multipathing on (virtual) tape drives, which is a very interesting subject.

  10. Storagezilla #
    10

    It did and it’s something I’ve long wondered as to why we haven’t seen more of it. A lot of people look to negate single path failures by going D2D2T and multipathing the disk target where the data first lands before it’s moved off to tape. Virtual or otherwise.

    Being a TSM shop you’re probably using disk pools anyway to deal with TSM’s lack of multiplexing. Though if TSM used multiplexing reclamation & restores would probably take months. ;)

  11. 11

    My experience with TSM and conversations with other TSM users learned me that diskpools in TSM are used so you don’t need like 600 tapedrives in order to cope with the backup load.
    The diskpools were not designed to cope with the problem that clients did not have the capacity to create streaming data to tape.

    We run about 500 backupsessions on any given moment during peak hours. Even when using multiplexing, we’d need a “shed” load of tape drives. Even in the case of virtual drives, this would exceed many shop’s budgets.

    Storagezilla: I wasn’t able to figure out why you think that TSM would have so much trouble when it were using multiplexing. I’d like you to explain that to me.