All posts by Nicolas Michel

Recover a RAID5 Array on Linux with healthy disks

Intel Atom failures

I know the title sounds a bit weird and you may ask why would you need to recover a RAID5 array when all your disks are healthy, right?

To understand what is going on, my DS1515+ has an Intel Atom C2538. (source: Synology CPU / NAS Type). It recently caused a lot of issues in the IT industry. (remember the Cisco clock issue? ๐Ÿ™‚ )

The Errata AVR54 of the C2000 Specifications update clearly states the following: “system may experience inability to boot or may cease operation”. My NAS was starting to have regular reboots and it completely crashed before I could back up the last delta of data. 

In the first instance, Synology denied any abnormal failure rate on this specific hardware while admitting a flaw (!). Synology then extended the warranty of all the NAS platforms affected by this hardware flaw.

 

Recovering the data using the GUI. (fail)

I immediately opened a case with Synology who sent me another DS1515+ pretty quickly. I still had to pay for express shipping).

After I inserted my disks into the newly received NAS, I noticed that the new NAS was beeping and was trying to recover my RAID5 array without any luck. The DSM told me that the Raid 5 array was down but all disks healthy.


 

I waited until the parity check was performed to verify if the Synology was silently trying to recover the volume. Unfortunately after 10 hours, nothing appeared in my volume list.

I decided to dig and found plenty of useful information provided by Linux experts in the Synology community (shootout to him/her).

Here is what I have done:

Recovering the volume using the CLI. (Semi-success)

First I wanted to check my raid information:

I knew md0 and md1 (system + swap) were fine but md2 (actual data) was not behaving properly even though my disks were “fine”.  The Raid 5 state is clean and the number of disks is accurate with what I have (/dev/sd[abcde]3). Let’s find with more detail the state of the RAID 5 array:

The partitions were good when I ran an fdisk -l so I tried to stop and reassemble the RAID 5 array.

When I tried to mount my partition as mentionned in the above link from the Synology forums, I had the following issue:

So I stopped the raid array and reassembled again and try to check with dmesg what was the status of my array:

AH ! The journal has an issue when loading so let’s try to mount it and load the journal. (Do not plan to do that for a long-term use of your NAS, my immediate concern was data recovery).

I guess the mount has been performed and let’s check if I could see something in the /recovery folder:

I could see my folders (some names are changes for obvious privacy reasons) but I was wondering how to retrieve my data now …. the GUI couldn’t see a volume and I couldn’t install any package on the volume in the GUI (because it didn’t see any)…… So I couldn’t FTP at all.

So I had another empty box with Linux running on it and I decided to do some Rsync backup from the old NAS (failed volume) to another NAS.

I am so happy I retrieved the delta of data and learned from my mistakes. I need to automate the backup more frequently and on one more that 1 device. Now I just have to wait until I could copy a few TB of datas.

I am far from being a linux guru but I know my way around bash. Network Engineers, you need to understand how linux is working. I had a very interesting conversation with Pete Lumbis a year ago at the Software-Defined Enterprise Conference & Expo  about how to learn linux … Pete and I had the same observation:

Most of the course I tried about Linux were not reaching the expectation I had, I was quickly bored even though I had to keep watching/reading to understand everything. On this case I did prefer to dig the technology by reading articles and get my hands very dirty.

Nic

Sorting list in Python

During my Python studies, I came across something that didn’t make much sense to me so I had to learn and investigate (with the help of experts).

What you can usually do in Python is to modify a variable and assign the result to the same variable. Because a piece of code is usually worth much more than an explanation:

When you want to sort a list, that behavior is a bit different:

let’s pretend I have a list of ARP entries into my switch:

If I want to sort it and reassign the value of it to the previously used variable I would use this code (Let’s pretend arp_entries is my variable that contains all these entries):

According to this python official documentation, Python lists have a built-in list.sort() method that modifies the list in place. Let’s verify this:

There is also a sorted list function that can do the job if you want to keep the original list intact:

I was testing this because I am currently working on the free python class that is run by Kirk Byers at https://pynet.twb-tech.com/ . To make the most of this course, I strongly recommend that course if you have a very small experience of programming. I will talk about that in a next blog post but in the meantime, have a look at kirk’s website. It’s awesome!

Thanks to Kirk, Nicholas Russo and Greg Mueller for the hints and help provided on slack ( Network to Code ran by Jason Edelman )           

Nic

Hyper-converged infrastructure โ€“ Part 2 : Planning an Cisco HyperFlex deployment

I recently got the chance to deploy a Cisco HyperFlex solution that is composed of 3 Cisco HX nodes in my home lab. As a result, I wanted to share my experience with that new technology (for me). If you do not really know what all this “Hyperconverged Infrastructure hype” is all about, you can read an introduction here.

Cisco eased our job by releasing a pre installation spreadsheet and it is very important to read that document with great attention. It will allow you to prepare the baseline of your HC infrastructure. The installation is very straightforward once all the requirements are met. The HX infrastructure has an important peculiarity, it is very very very (did I say very) sensitive …. if one single requirement is not met, the installation will stall and you will be in a delicate situation because you could have to wipe the servers and restart the process. As a result, you could lose precious hours.

Cisco has a way to automate the deployment and to manage your HX cluster.Finally, The HX installer will interact with the Cisco UCSM, the vCenter, and the Cisco HX Servers.

It is especially relevant to note that the Cisco HX servers are tightly integrated with all the components described in the picture below:

HyperFlex Software versions.

As usual with this kind of deployment, you have to make sure that every version running in your environment is supported.  We will run the 2.1(1b) version in our lab and will upgrade to 2.5 at a later time. We need to make sure that our FI UCS Manager is running 3.1(2g).

In addition, the dedicated vCenter that we will use is running the release 6.0 U3 with Enterprise plus licenses.

Nodes requirements.

You cannot install less than 3 nodes in a Cisco HyperFlex Cluster. Because the HX solution is very sensitive, it is mandatory to have some consistency across the nodes regarding the following parameters:

  • VLAN IDs
  • Credentials 
  • SSH must be enabled
  • DNS and NTP
  • VMware vSphere installed.

Network requirements.

First of all, the HyperFlex solutions require several subnets to manage and operate the cluster.

We will segment these different types of traffic using 4 vlans:

  • Management Traffic subnet: This dedicated subnet will be used in order for the vCenter to contact the ESXi server. It will also be used to manage the storage cluster.
    • VLAN 210: 10.22.210.0/24
  • Data Traffic subnet: This subnet is used to transport the storage data and HX Data Platform replication
    • VLAN 212: 10.22.212.0/24
  • vMotion Network: Explicit
    • VLAN 213: 10.22.213.0/24
  • VM Network: Explicit
    • VLAN 211: 10.22.211.0/24

Here is how we will assign IP addresses to our cluster:

UCSM Requirements.

We also need to assign IP addresses for the UCS Manager Fabric Interconnect that will be connected to our Nexus 5548:

  • Cluster IP Address: 
    • 10.22.210.9
  • FI-A IP Address:
    • 10.22.210.10
  • FI-B IP Address:
    • 10.22.210.11
  • A pool of IP for KVM:
    • 10.22.210.15-20
  • MAC Pool Prefix:
    • 00:25:B5:A0

 

DNS Requirements.

It is a best practice to use DNS entries in your network to manage your ESXi servers. Here we will use 1 DNS A records per nodes to manage the ESXi server. The vCenter, Fabric Interconnect and HX Installer will also have one.

The list below will show all the DNS entries I have used for this lab:

  • srv-hx-fi
    • 10.22.210.9
  • srv-hx-fi-a
    • 10.22.210.10
  • srv-hx-fi-b
    • 10.22.210.11
  • srv-hx-esxi-01
    • 10.22.210.30
  • srv-hx-esxi-02
    • 10.22.210.31
  • srv-hx-esxi-03
    • 10.22.210.32
  • srv-hx-installer
    • 10.22.210.211
  • srv-hx-vc
    • 10.22.210.210

This sounds very basics and as a consequence, it is CRITICAL that these steps are performed PRIOR any deployment otherwise you will waste a lot of time trying to recover (at some point you would have to wipe your servers and reinstall a custom ESXi image on each one). 

Finally, In the next blog post, I will show how to install the vCenter, The Fabric Interconnect and the HX installer needed for the HyperFlex deployment.

In conclusion, do not hesitate to leave a comment to let me know if you encountered any issue while planning your deployment.

Thanks for reading!  

Hyper-converged infrastructure – Part 1 : Is it a real thing ?

Recently I was lucky enough to play with Cisco Hyperflex in a lab and since it was funny to play with, I decided to write a basic blog post about the hyper-converged infrastructure concept (experts, you can move forward and read something else ๐Ÿ™‚ ). It has really piqued my interest. I know I may be late to the game but better late than never right? ๐Ÿ™‚

Legacy IT Infrastructure

Back in the days, you had to have separate silo to maintain a complete infrastructure (it is still true by the way, but it tends to become more and more frequent that networks, servers, and storage are progressively forming a single IT platform …. sorry I meant “cloud”):

  • Compute(System and Virtualization)
  • Storage
  • Network (Network and Security)
  • Application

You had to install and maintain multiple sub infrastructures in order to run the IT services in your company. 

If  you wanted to deploy a greenfield infrastructure for your data center, here is a brief summary of what you needed:

  • Physical servers (Owners: System team)
  • Hypervisors (Owners: System team)
  • Operating system (Owners: System team) 
  • Network infrastructure (Owners: Network team)
    • Routing – Switching
    • Security (VPN, Cybersecurity)
    • Load Balancers
  • Storage arrays (Owners: Storage team)
  • Applications for the business to run. (Owners: IT applications team)

Each silo has its own experts and language (LUN + FLOGI vs GPO + AD vs OSPF, BGP and TLS). As you can guess, it was a bit complicated and long to provision new applications and services for any business (even in a brownfield IT environment). Once everything was running, the IT team was in charge to maintain the infrastructure and one of the drawback was dealing with several manufacturers (and potentially partners) to maintain your infrastructure…. 

Converged Infrastructure and simplification

In the late 2000s, famous manufacturers saw an opportunity to simplify the complexity of the complete data center stack and converged infrastructure was born.

With the emergence of cloud applications, EMC and Cisco created a joint venture Acadia that will later be renamed VCE for (VMware, Cisco, EMC). The purpose of that company was to sell converged infrastructure products. Vblock was the flagship product. As you know, you could buy an already provisioned rack that was customized according to your preferences. The vBlock was composed of the following individual products:

  • Storage Array: EMC VNX/VMAX 
  • Storage Networking: Cisco Nexus, Cisco MDS
  • Servers: Cisco UCS C or UCS B
  • Networking: Cisco Nexus
  • Virtualization: vSphere

VCE was in charge of configuring (or customizing I should say) the vBlock according to your need and preference.

Once the network was delivered, you “just” had to plug it in your data center networking infrastructure and everything should be connected. Servers were ready to be deployed.

Going that way, you could save time and trouble. Agility is also a big selling point for these kinds of architectures. 

As you can see, the footprint for these products was still consequent. in this case, you had to deal with a single manufacturer but the main drawback is the product flexibility. You could not install any version on your Cisco Nexus because VCE was very strict on the supported version.

Hyper-converged Infrastructure and  horizontal scaling

Hyper-converged is a term that has been rolling since 2012. The main difference between converged and hyper-converged infrastructure is definitely the storage 

  • Converged infrastructure:
    • Centralized array accessible using a traditional storage network (FC with FSPF or ISCSI/NFS)
  • Hyper-converged infrastructure:
    • Distributed drives in each servers forming a centralized file system.

Hyper-converged system has the ability to be adaptable. The way it scales is horizontal while reducing the footprint by a significant amount. If you just want to try it, just perform a setup with few hosts and if the solution works for you, just add nodes to the cluster horizontally and you will increase your performance and redundancy.  This way, you can consolidate your compute and storage infrastructure.

Horizontal scaling is a familiar concept for many network engineers (Clos Fabrics anyone?)

In my opinion, it is a natural evolution of the Data Center compute and storage infrastructure.

There are several “Hyper-converged” manufacturers on the market:

My next post will be about deploying a Cisco Hyperflex infrastructure.

Thanks for reading !

 

From Network Engineer v1.0 to v2.0

I recently relocated to the US from France/Switzerland and I have been so busy the past 2 years working on that process. Yes, It is that long! 

I have been asked about career advice twice this week and I wanted to share my thoughts about it.

Networking in 2008

I think we all agree on the fact that the networking field has been very static for the past 15 years. One of the ways to provide a better network experience to the users/applications was to add more bandwidth (or invest in WAN optimization). OSPF/BGP/EIRGRP/MPLS and spanning tree haven’t changed much since 2002 right?

 
All the networking manufacturers paradigm was all about releasing new hardware that could provide more bandwidth and availability. As an engineer, you had to know networking protocols but we also had to understand specifics of networking hardware. It was very useful to understand how the 6500 Crossbar was switching packets internally. Another example was the StackWise technology: who remembers that the 3750 v2 could not locally switch without sending packets on the ring?.

Every device had a specific function in the network for example (which is still true at some point). Engineers were doing was vendors told them to do and they had to standardize their deployment (Access – Distribution – Core). It was a safe bet to design to design a network using the 3 tiers architecture mentioned previously.

 

Some networking engineers are self-educated up to a certain point and one of the ways to learn networking back in the days was to read a Cisco Press book, buy some hardware (2950 – 3600) on eBay and do some labs on your own or using a third party training company. For these engineers, the way to get a job was to climb the traditional certification pyramid (CCENT – CCNA – CCNP – CCIE). While this is still kinda relevant, the CCIE does not automatically open doors for any jobs anymore. Matt Oswalt published a quote that makes total sense “vendor certs are basically a way of putting the vendor in control of your career. On the other hand, fundamental knowledge puts YOU in control”. 

I have a dual CCIE and studied very hard to get where I am today but the journey is far from being over (hopefully). I need to be a little less focused on proprietary certification and get some open source knowledge as well. (Damn CCDE you are tempting but I need to resist !)

Linux/Python skills were definitely not mandatory in any of the job descriptions back in the days. But as you can guess it becomes more and more a requirement nowaday.

I’ve been invited to a very interesting dinner with CIOs of Fortune 100 companies recently. They are all aware of the ongoing networking transition. They admitted it was not an easy plan to embrace this evolution but they are already preparing their teams for that.  

Speaking of technologies, which technologies are we talking about? Do we need to know everything in IT? the answer is obviously “No” but it is valuable to at least understand how all the systems are interconnecting to each other.

Here is what a job description looked like back in the days (2008):

 

The need for evolution

I am doing this blog post is because our field is changing and our skills need to evolve with the networking trends. Engineers are the core of the networking industry. We all have a critical function in every organization that is willing to undertake their “business digital” transformation. We need to prepare how to evolve with the upcoming technologies.
I am willing to create a blog post series on how to tackle your own networking evolution. Please do not get me wrong, we still need to understand bits and bytes of all the networking protocols in order to provide connectivity. This statement will never go away (hopefully) and there is no working overlay if the underlay as been designed carefully. What needs to evolve is the way we are able to provision services for our customers/users/applications. When was the last time you heard that the networking team was taking too long to provide connectivity between A and B? 

 

Networking in 2014+

Short story long, network engineers have to stay relevant throughout the years. 

Today it would be a bit different, it is definitely expected to know everything that is above right  (except maybe Cisco Works and CatOS ๐Ÿ™‚ )

Himawan Nugroho made a great Cisco Live presentation that I attended in Milan: BRKSDN-4005 – CCIE Skill transformation to SDN kungfu. The most interesting slide for me is the following one: 

 

He confirmed what I was explaining above. You still have to be an expert at traditional routing/switching but also have a broader knowledge of the following technologies:  Linux and Operating Systems, Scripting, Overlays (proprietary and standards) and network virtualization. 

Some new protocols and ways to provide network connectivity have recently emerged. Some of them are already dead (Trill anyone ?) and other are being used worldwide in different flavors (VXLAN anyone ?). 

We see plenty of blog post related to the eternal question: Should we learn how to script/code:

My take on this is that you should be able to automate your network and most of your tasks. You should not consider going too deep (for now). We are not required to become a full-time developer.

Some of the following items you will find on this list are not necessarily new but it is something that the network engineers can’t avoid to be aware of anymore. This is by no means an exhaustive list but it gives you an indication of what the current trends are in our industry. Feel free to drop a comment if you think something valuable should be added.

.

Acquiring all of these skills do not happen overnight so I will publish quite a few blog posts about how I am preparing my own evolution. Let me know in the comments below what you liked, disliked or if you have any question.

Nic