Poisoned ARP Entries

Posted on August 20th, 2012 in Uncategorized | No Comments »

I ran into an interesting issue while at a customer site last week that I had never run into before. The setup:

  • Unified VNX5500 array with dual control stations and no IP aliasing set up.
  • Proxy ARP was configured to advertise the IP addresses of the storage processors
  • Customer used IP for storage processor B that was later given to a vCenter VM instance causing SPB to become unreachable over the network.
  • SPB IP was reachable from the control station and from directly attached host, but not reachable over the network.
  • Running /nas/sbin/clariion_mgmt -recover did not resolve the issue.

Proxy What?

For integrated Unified VNX arrays EMC uses proxy arp to allocate IP addresses to the storage processors even though they are not directly connected to the customer network. They are instead cabled to the management controllers on the Data Mover blades as shown below.

VNX5500-MGMTCables

In turn, the management controllers are connected to the control stations which then have connectivity to the customer network through the MGMT connections.

VNX5500-MGMTCablesToCS

Cabling the array in this manner means that in a single control station configuration only a single Ethernet cable needs to be run to the system for management instead of three.

Solution

It seemed that a poisoned ARP entry was to blame but we had no direct access to the switches so we could not confirm this to be the case from the network side. To test that theory we forced the control station to perform a gratuitous ARP for the IP of the storage processor that was not reachable and that resolved the issue!

To issue a gratuitous ARP from the control station use the following commands:

echo 1 > /proc/sys/net/ipv4/ip_nonlocal_bind
arping -c 5 -A -I eth3 SP_IP_ADDR
echo 0 > /proc/sys/net/ipv4/ip_nonlocal_bind

Since our clients were being routed to the network with the VNX we could not check the arp table of the hosts we were using. In a normal setup you should see the MAC addresses for all three (or four with a dual CS setup) on the SAME MAC if you are observing from the same L2 segment with no router between you and the array.

C:\>arp -a

Interface: 172.16.1.10 — 0xc
  Internet Address      Physical Address      Type
  172.16.1.1            44-2b-03-57-d8-42     dynamic
  172.16.1.30           00-50-56-9c-22-4f     dynamic
  172.16.1.31           00-50-56-ae-00-31     dynamic
  172.16.1.32           00-50-56-ae-00-32     dynamic
  172.16.1.35           00-60-16-36-bb-ff     dynamic
  172.16.1.40           00-1b-21-d2-8e-72     dynamic
  172.16.1.41           00-1b-21-d2-8e-72     dynamic
  172.16.1.42           00-1b-21-d2-8e-72     dynamic

Note: This should work the same even if IP aliasing is used since the same MAC address is assigned to both logical interfaces.

~JediMT

Flash in the VNX – Part 1 – Introduction

Posted on August 11th, 2012 in Flash 1st | No Comments »

Without a doubt, the topic that interests me the most and one that generates many calls to my phone is the use of flash technologies in the VNX in the form of FAST VP and FAST Cache. Everyone who comes calling knows that these technologies can help them, but they have a hard time quantifying how and why they are helpful.

What’s the problem here?

From a practical standpoint, there are two issues that EMC is attacking with flash in the array.

  1. Total cost of ownership (TCO)
  2. Performance

To address these two issues EMC implemented flash technologies in such a way that it could be used as both a user addressable storage device as well as a type of extendable cache for the array.

Flash as a TCO Tool

When addressing the TCO problem, Flash is implemented in the form of FAST VP, which is a form of dynamic storage tiering that allows pools of storage to be created from a population of Flash, SAS and NL-SAS devices. In effect this allows the array to automatically place frequently accessed data in the pool on Flash drives and less frequently accessed data on the SAS and NL-SAS devices. It does this by moving 1GB “slices” from one tier to another.

FAST-DataActivity

A common example of where FAST VP helps in this case is with relational databases like SQL and Oracle. Prior to FAST VP, to get ‘good’ performance for a large database the storage administrator had to allocate a relatively small area from a large number of drives of a single type to spread the database out over many disks. This practice is generally referred to as ‘short stroking’.

For example, a 2TB database that requires 15,000 IOPS might have to be carved out of somewhere around 210 300GB 15K drives assuming a 50:50 read:write mix and RAID 5 protection. [I’m excluding the math for the intro, but will get into it in the detail sections in follow on articles]

That is a lot of disks! The greater issue is the database demands 100% of the I/O ability of those 210 drives but only 4% of the capacity of those drives (2000 GB out of 46,704 GB allocated to the database).

An interesting property of a lot of databases is that the commonly accessed tables collectively consume a relatively small amount of the total capacity of the database. For instance, that 2TB database may only access 5% of its table space on a regular basis. Another way of looking at that is the database may do 95% of its I/Os in only 5% of its capacity. EMC calls this number relationship between capacity and I/O density ‘skew’.

FAST VP helps take workloads with high skew rates and distribute the high I/O requirements of the workload to faster (flash) drives and the high capacity requirements to larger drives (SAS and NL-SAS). That same 2TB database with a 95% skew rate may then be placed on only 15 small flash drives and 25 SAS drives and achieve better performance and capacity utilization!

In that example FAST VP lowers TCO by reducing CAPEX costs by reducing the disk count from 210 to 40 as well as eliminating the additional cost of the DAEs. It also helps the OPEX costs by trimming the array’s foot print in the datacenter and  reducing the power and cooling required. When leveraged correctly FAST VP can dramatically reduce cost by making sure that the right data lives on the right type of storage over time.

Flash as a Performance Booster

If FAST VP asks the question “How can I best allocate a workload over time to minimize cost and improve performance?” then FAST Cache asks “What can I do to improve performance of the entire array in real time?”.

The goals of FAST Cache are:

  • To extend the functionality of the DRAM cache by mapping frequently accessed data to Flash drives which are an order of magnitude faster than HDDs.
  • To provide a much larger, scalable cache by virtue of using Flash drives that can provide data capacities up to 200GB per device
  • To improve the benefits of write hits, write coalescing, and write ordering by deferring host writes destined for the HDD’s as long as possible.
  • To decrease the response time of HDD’s to read cache misses by managing workloads through buffering in cache.

FAST Cache is configured as an array wide resource that can be enabled on both traditional RAID groups as well as storage pools. When enabled, any 64KB ‘chunk’ of data that is accessed more than two times in a short period is asynchronously copied from the source SAS or NL-SAS drives to the Flash drives that make up the FAST Cache. Subsequent reads or writes to these promoted chunks are serviced from the FAST Cache flash drives, not the source drives. [Note: If using FAST Cache in conjunction with FAST VP, data that exists on a flash tier will not be promoted to FAST Cache]

In a nutshell, FAST Cache allows the use of very fast Flash drives as a type of ‘extended’ cache for the array. For each array the supported FAST Cache sizes are listed below.

FASTCache-MaxSize

This allows us to create cache structures are that are many times larger than what can be provided with the embedded array DRAM cache. In the case of the VNX7500 the max FAST Cache is 2.1TB compared to the max DRAM write cache of just over 14 GB with no enhanced data services such as FAST VP or compression installed.  That much cache allows the array to be much more intelligent on how it orders data before it flushes it to disk which makes the entire array perform better.

The impressive thing about FAST Cache is that it can be added non-disruptively (or disabled or resized) to an existing configuration and provide immediate performance improvements. The one caveat I would offer is that when the FAST Cache is created the write cache is temporarily disabled and resized to accommodate the memory overhead of the FAST Cache feature. This is discussed in more detail here. This will impact performance of the array while the FAST Cache is being created. After the FAST Cache has been created check to make sure the read cache isn’t zeroed out.

Richard Anderson created an excellent write up of how effective FAST Cache and FAST VP can be with real world data from a customer in his blog at storagesavvy.com. Check it out here: http://storagesavvy.com/2011/03/26/real-world-emc-fastvp-and-fastcache-results/

Another good example can be found here: http://sudrsn.wordpress.com/2011/03/19/storage-efficiency-with-awesome-fast-cache/

Wrap Up

With that brief introduction behind us the next couple of articles are going to get into the meat of how to really leverage these technologies to their fullest.

~JediMT

Setting your read and write cache sizes

Posted on July 16th, 2012 in Best Practice | 2 Comments »

When configuring an array for optimal performance one of the more fundamental things that is critical to nail is the correct cache configuration. Every I/O in the VNX system flows through the DRAM cache so misconfiguring it can have negative effects on the performance of the workload running on top of it.

The DRAM cache on the VNX can be divided up into a read cache and a write cache. The read cache is configured independently for each SP and is not shared between them. The write cache is configured for both SPs together and is mirrored between the two SPs as shown in the graphic below.

VNX_cache

Configuring the read cache

The minimum recommended read cache size is 100 MB for the block only VNX5100 and for the unified systems the recommended read cache size ranges from 400 MB to 1024 MB as shown in the table below.

Array Model VNX5100 VNX5300 VNX5500 VNX5700 VNX7500
Recommended Initial Read Cache Size 100 MB 400 MB 700 MB 1024 MB 1024 MB
 
The recommended read cache sizes are proposed as starting points and can be adjusted up or down depending on the type of workload that will be serviced by the array. The minimum recommended read cache size for any unified VNX system is 256 MB per storage processor. [Note: This minimum read cache is being changed to 200 MB in the upcoming best practices guide from EMC but is not yet published]
 
SPMemory
 
The read cache is most effective when the majority of front end I/Os are reads and are sequential in nature. If the workload is 50:50 and there are sequential read streams in the dataset the initial recommendation may work well. In virtualized environments with many hosts doing I/O to the same file system the workload tends to be much more random and the read cache starts to become less effective and therefor less relevant.
 
To see how effective the read cache is for a workload look at the “SP Cache Read Hit Ratio” for any traditional LUN in Analyzer. If the counter is consistently above 80% then read cache is being very effective and there may be gains to making a small increase in the read cache size to satisfy more read requests from read cache.
 
However, If the number is very small then the workload for that LUN may not be read cache friendly. To help confirm, look at the “used prefetches %” counter for the same LUN. If this is also low the array is not getting read requests for the data it is proactively prefetching from the LUN into the read cache when it thinks it has detected a sequential read stream. In this case, read cache is not helping much and the prefetching activity is only resulting in wasted read activity on the LUN.
 
Note: If you don’t see the options for “used prefetches %” or “SP Cache Read Hit Ratio” in Analyzer then go to the “customize charts” option in Analyzer and check the “Advanced” checkbox in the General tab. The counters should now show up for LUNs created from RAID Groups. LUNs created from pools do not have these counters because of how the stats are compiled against private RAID groups which are not visible through Analyzer.
 
Analyzer-Chart
 
If the majority of the LUNs in the system show similar behavior and if the read cache is comparatively large it may make sense to reallocate some memory to the write cache. Even though it’s named a write cache, reads can be serviced out of the write cache so if the workload tends to write data then read it back again quickly most reads are probably being serviced from the write cache, not the read cache!

 

Configuring write cache settings

The write cache has more bearing on array wide performance than any other single feature and its configuration and continued care and feeding are essential to maximizing the performance potential of the system. In general, the larger the size of the write cache, the better the potential performance of the array is. There are exceptions to this, but exploring them is beyond the scope of this article.

The size of the write cache increases with the “size” of the array, so a VNX7500 would have a larger write cache size than a VNX5300. However, you may run into systems where smaller arrays have a larger write cache size than a larger array. “How can this be” you ask?

The table below shows how this can be true in some circumstances. When some features are enabled on the array the amount of DRAM reserved for the storage processors increases. As the storage processor reserved DRAM pool gets larger, the rest of the system cache, of which write cache is a component, must get smaller. So by looking at the table below you could see how a fully featured VNX7500 could have a smaller write cache than a VNX5700 with no advanced features enabled.

VNX write cache sizes

The features that require additional DRAM to operate are FAST, FAST Cache, thin provisioning and compression, so installing any of them will lower the available write cache in the array so the features should only be installed if there are definitive plans to use the features. SnapView and MirrorView also require some system memory but EMC is moving away from both of those technologies in deference to Advanced Snaps in release 32 (Inyo) and RecoverPoint.

The net takeaway is that the write cache should be configured for as large a value as possible after allocating DRAM for the read cache and installed features. The other settings that require some attention are the watermark settings.

Write cache watermarks

The cache watermark settings exist to help the system manage write cache flushing. The goal is to minimize write cache forced flushing and maximize write cache hits by controlling how aggressively and how often the array flushes data from the write cache. Aggressive flushing allows the array to respond well to very bursty I/O patterns at the cost of cache re-hits which can make the cache somewhat less effective. Lazy flushing allows for higher cache re-hits which can make the cache more effective, but also allows less headroom for bursty I/O and can lead to forced flushing.

So what exactly do the watermarks control?

System_properties_watermarks_cache

When the percentage of dirty pages is below the low water mark (LWM), the array is not flushing data out of the write cache. Typically, you will not see prolonged periods where dirty pages < LWM. Usually at this stage we are building up the cache and this happens very quickly in production work load scenarios, but may take a bit of time in limited benchmark activity.

When a LUN is idle (no I/O for two seconds) the system will commit dirty pages in the write cache to the LUN. This is referred to as idle flushing and runs as a normal background activity on the array.

During steady state load the percentage of dirty pages should float between the high water mark (HWM) and LWM. When the cache fills up to the HWM the array starts flushing pages to disk till it reaches the LWM. This is a more aggressive flush than idle flushing.

The margin above the HWM exists to absorb bursts of I/O to the array. Setting the HWM lower in effects give the array more “reserve” memory for these bursts. Once the percentage of dirty pages exceeds the HWM the array starts aggressively flushing dirty pages down to the back end disks. During this period the SP performance is minimally affected.

When a write request is received and cache is already full, a forced flush is triggered to write the pages of the destination LUN receiving the current request to disk to free up cleared pages in the write cache to receive the I/O. Even well designed systems can have some forced flushes from time to time, but sustained forced flushing activity will hamper system performance. Any LUN showing a sustained rate of more than 30 forced flushes a second should be analyzed.

Setting the watermarks

The default watermark values in VNX OE 31 and later are 80 for HWM and 60 LWM respectively. Previous to release 31 the defaults were 70/90 for unified systems. To break this down further, look at the graphic below which shows the “SP Cache Dirty Pages %” counters per SP. There are three lines I’ve added to the graph to show the LWM (green line – 70%), HWM (orange line – 90%) and cache full (red line – 100%).

watermark

In a “well behaved” environment where the I/O is more or less constant the objective is to keep the LWM and HWM set relatively high. This slows the rate at which the array flushes dirty pages to disk. By allowing data to live in the cache longer the array is able to maximize the chances that it can coalesce smaller front end I/Os into larger back end I/Os, thus making the backend more efficient. This is particularly important for parity RAID types like RAID 5 and RAID 6. Full stripe writes FTW! … another time and another article.

For the vast majority of the time slices in the above graph, the SP dirty pages are hovering between the LWM and HWM, but there are excursions above the HWM to about 95% dirty pages. This is usually indicative of either a bursty workload or a situation where we are driving the back end disks a little too aggressively. In this particular example, it would be wise to figure out which is the case before adding additional load to this array and lowering the water marks to 60/80 would be advisable.

In general a safe convention is to leave the watermarks at the default of 60/80 unless the array is servicing a very bursty workload in which case the watermarks can be lowered to 50/70 or 40/60 in some extreme cases.

~JediMT

System drives in the VNX/Clariion

Posted on July 14th, 2012 in Storage Architecture | 1 Comment »

A couple days ago, I had a partner ask about the system drives in the VNX; specifically about how they should be used, if at all. We had a spirited conversation and I’d like to share some of what I shared with him.

CX4-Private-Space-Vault
Private space layout for Clariion CX4 system drives

As you can see from the graphic above, the system drives (vault drives from here on out) have quite a bit going on. The most important items on the vault drives are the boot partitions for the storage processors, the persistent storage manager (PSM) LUN, the FLARE database set and the cache vault.

The private space for the VNX systems is similar except it uses 4 drives instead of the 5 drives on the CX systems. If you happen to have a unified VNX system the control LUNs for the File OE also live on the vault drives by default.

Considerations for using the vault drives

When configuring the vault drives here are a couple of things to consider.

  1. Configure a RAID 5 4+1 RAID group across the vault drives and bind a LUN to fill up the unused space, even if no data will ever reside on them. This LUN doesn’t have to be presented to any host. The reasoning here is that the “SNiFFER” background verify only runs on disks allocated in a LUN. If the drives are left unbound then SNiFFER will not run a background verify against the drives and media errors on the vault drives could be left undetected until an actual drive failure occurs. [emc224729]
  2. If user data will be on the drives, configure a RAID 5 4+1 RAID group and carve up two LUNs and balance them across the storage processors. Additionally, make sure that the vault drives are not doing more than 100 IOPS each if preparing to do a NDU operation and less than 150 IOPS if using 15K SAS/FC drives or 120 IOPS if using 10K SAS/FC drives under steady state load. [emc79630]
  3. Do not bind RAID groups or LUNs across the vault drives and other drives in the system. Also, in general, it’s a good idea not to bind LUNs or FAST/FASTCache objects that cross DAE0 and any other DAE. If there is a total system outage (power failure) any LUN bound across DAE0 may be temporarily inaccessible when the system comes back online.
  4. If you are a VMware customer, this is a good spot to drop ISO image files you may use from time to time for deploying and patching VMs.

Hope yall find this useful when planning a storage configuration. Courteous comments welcome.

~JediMT