$ServerName.active

Wednesday, December 11, 2024

FinOps, The next step in "operational" development

In several of my recent posts, I’ve discussed using Lambda scripting to identify and clean up unused resources in AWS environments. While these tasks traditionally fell under DevOps, they are now part of a broader discipline known as FinOps. Short for Financial Operations, FinOps merges financial management with operational efficiencies to maximize the value organizations derive from cloud computing.

Although the FinOps Foundation formally established the concept in 2019, its principles date back to the early 2010s. During this time, businesses began focusing on managing cloud costs as the shift from capital expenditure (CapEx) to operational expenditure (OpEx) models made cost efficiency a priority.

The Importance of Tagging

Tagging in cloud environments, particularly in development settings, is a foundational practice that can transform the way organizations manage and optimize their resources. Beyond simple organization, tagging serves as a critical tool for financial operations, resource accountability, and operational efficiency. By implementing a robust tagging strategy, teams can address common challenges in cloud resource management, such as uncontrolled costs, unclear ownership, and untracked manual processes.

Managing EBS Snapshots with Lambda functions

Recently, we faced a situation where we found an account with over 25 TB of EBS snapshots, some of which dated back to 2017. These old snapshots had been piling up, creating substantial, unnecessary costs. We realized that without cleanup, costs would only increase, especially in our dev environment, where frequent changes to snapshots were generating excess storage overhead. This Lambda function was developed as a solution to automate the cleanup of outdated snapshots and refine our volume snapshot policy, allowing us to regain control over storage costs effectively.

Schedule Lambda Functions Using AWS EventBridge

Cleaning up stopped EIPs instances is a crucial maintenance task for AWS accounts to avoid unnecessary costs associated with EIP instances not attached to a resource. To streamline this process, I’ve set up two versions of a Lambda function to automate the identification and deletion of stopped instances.

Each week, one version of the Lambda function runs on Thursday to inspect stopped instances and log them for review, while another version runs on Saturday to delete the identified instances. This two-phase approach allows time to verify what instances are flagged for deletion before executing the cleanup.

Amazon EventBridge

To run the Lambda function on a specific day, you can use Amazon EventBridge (formerly CloudWatch Events). EventBridge allows you to create a scheduled rule that triggers the Lambda function at a specific time.

Navigate to EventBridge in the AWS Management Console.
Create a Rule:
- Set the rule type to Schedule.
Define the cron expression or rate expression for the desired schedule. For exam

To run at 7 AM every Thursday: cron(0 7 ? * 5 *)
This cron expression means: "At 07:00 AM UTC on every Thursday."

Set Target to your Lambda function.

Step 2: Pass a Specific Set of Environment Variables

To use specific environment variables for a particular run:

Use AWS Lambda Versions and Aliases:
- You can create different versions of the Lambda function, each with its own set of environment variables.
- For example, you can create a version with inspection variables (DELETE_QUEUES=False) and another with deletion variables (DELETE_QUEUES=True).
- Assign an alias to each version (e.g., inspection and deletion).
EventBridge Rule Target Configuration:
- In the target configuration of the EventBridge rule, specify the alias for the Lambda version you want to run.
- This allows you to run different versions of the Lambda function based on the schedule.

Step 3: Use Code Variables

If you need to dynamically set environment variables for each run:

Update Environment Variables in Code:
- Modify the Lambda function code to accept environment variable overrides via the event payload.
import os
def lambda_handler(event, context): # Override environment variables if provided in the event delete_queues = event.get('DELETE_QUEUES', os.getenv('DELETE_QUEUES', 'True')).lower() == 'true' send_slack_message = event.get('SEND_SLACK_MESSAGE', os.getenv('SEND_SLACK_MESSAGE', 'True')).lower() == 'true' # Your logic here...
Create implementations of EventBridge with different versions of the code and variables:

Assuming you already have a Lambda function that checks for non-running EC2 instances and deletes them if required, you’ll need to create two separate versions:

Version 1: For inspection (running every Thursday, without deleting).
Version 2: For deletion (running every Saturday, with deletion enabled).

Version 1 of my code ONLY sends a slack notification

Version 2 of my code sends slack notifications AND deletes the instances

Step 4: Use Environmental Variables

These changes can also be done through the environment variables within the lambda job. (within the lambda function go to configuration-> environment variables)

By managing environment variables at the Lambda function level, you maintain a clear separation between inspection and deletion tasks, making it easy to configure and schedule them appropriately.

Monday, October 14, 2024

Stuck ArgoCD Deployment: How to Fix Finalizers causing Stuck Pods

When using ArgoCD to automate your application deployments, you might encounter an issue where a deployment gets stuck in the Progressing state. This often happens due to kubernetes finalizers not being released properly, preventing the related pods and resources from being destroyed.

Finalizers are Kubernetes resources that ensure certain clean-up tasks are completed before an object is fully deleted, but if they malfunction or don’t get removed, they can block deletion and cause your deployment to hang indefinitely.

I’ll walk you through a step-by-step process to resolve this issue by manually removing the finalizers and then successfully deleting the stuck ArgoCD app.

Step-by-Step Guide

Step 1: Identify the Problem

When an ArgoCD app gets stuck in the "Progressing" state, you can confirm the issue by inspecting the status of the app and looking for finalizers that are preventing deletion.

You can do this by running the following command:

kubectl get app APP_NAME -o yaml

Look for the metadata.finalizers field. If you see finalizers listed but the app cannot progress to completion, that’s the cause of the problem.

Step 2: Patch the App to Remove Finalizers

To resolve the stuck state, you need to remove the finalizers from the ArgoCD app. You can do this by running the following **kubectl** command:

kubectl patch app APP_NAME -p '{"metadata": {"finalizers": null}}' --type merge

This command removes the finalizers from the app, allowing Kubernetes to bypass the finalizer logic and proceed with the app deletion.

Step 3: Patch the CRD (If Necessary)

In some cases, the Custom Resource Definition (CRD) associated with the app may also have finalizers that are causing the stuck state. To remove the finalizers from the CRD, use the following command:

kubectl patch crd CRD_NAME -p '{"metadata": {"finalizers": null}}' --type merge

This ensures that any finalizer present in the CRD is also removed, allowing for complete deletion of the resources.

Step 4: Delete the Stuck App

After patching the finalizers, you can safely delete the stuck app by using the following command:

kubectl delete app APP_NAME

Step 5: Delete the CRD

If needed, you can also delete the CRD after patching its finalizers:

kubectl delete crd CRD_NAME

This should fully remove the app and any related resources, resolving the stuck progressing state.

Deployments getting stuck in the Progressing state in ArgoCD can often be traced back to finalizers not being properly removed, which blocks resource deletion. By manually patching the app and any related CRDs to remove finalizers, you can resolve this issue and successfully delete the resources.

If you’re facing this issue frequently, consider reviewing the finalizer behavior in your environment and ArgoCD configurations to ensure that finalizers are being handled correctly during normal operations. Properly configured finalizers will help avoid these kinds of issues in the future.