In the world of Kubernetes, intermittent issues are notably tricky. Unlike persistent issue, These intermittent issues are come and go, which makes it hard to track and spot. In this insightful article, we will dive into the complexities of diagnosing and approaching the intermittent DNS issues.

The term ‘intermittent’ evokes the following considerations:

limitations.
Abrupt vanishing of resources that have not undergo safe eviction processes.
Missing/Incorrect configuration.
Bug.

These considerations can work with any intermittent issues Not only DNS issue.

Let’s go through all of these considerations one by one.

1. Limitations

Networking resources are not unlimited, so you need to check the following.

A.) Check if DNS queries reached the limit of 1024 DNS query packet per second (PPS)

Some intermittent DNS resolution are caused by hitting the limit of 1024 packet per second (PPS) when sending DNS queries to Amazon Route 53 Resolver.

Use tcpdump (Linux only)

a.) Use the following command to take rotating packet captures on your EC2 instance. The following command captures the initial 350 bytes of the packet and saves 20 files of 100 MB each while overwriting the old packet captures.

sudo tcpdump -i eth0 -s 350 -C 100 -W 20 -w /var/tmp/$(curl http://169.254.169.254/latest/meta-data/instance-id).$(date +%Y-%m-%d:%H:%M:%S).pcap

b.) Run the following Linux command to determine the number of DNS queries sent.

tcpdump  -r <file_name.pcap> -nn dst port 53 | awk -F " " '{ print $1 }' | cut -d"." -f1 | uniq -c

c.) If the number of DNS queries is greater than or equal to 1024 per second, any additional queries are throttled.

For more additional confirmation, Filter the CloudTrail with this errorCode “”errorCode”:”Client.RequestLimitExceeded””.

Resolution

Implement NodeLocal DNSCache .

by implementing, Pods will reach out to the DNS caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking.

This will improve DNS performance by keeping queries local, instead of hopping to the kubernetes service(IPTables) and forwarding to a pod which may not be on the same host. for more details check the motivations behind nodelocal cache dns [2].

B.) having network intensive application with intensive east-west traffic

underestimating the system networking requirements, will definitely cause some drop for packets and increase net_conntrack_allowance_exceeded.

When a query run from the pod the search fields will be added automatically, so if the packet has the correct field has dropped, all the subsequent fields will not able to resolve it and you will end up with unresolved query.

for example :
let’s say you have the below search fields in your pod’s /etc/resolve.conf and a query from the pod trying to resolve “test-svc“.

search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal

It will append the first field to the query so it will be “test-svc.default.svc.cluster.local“ and this is the right query and should be resolved, but unfortunately it is dropped due to the contrack table is full (cannot track any more connections). if the right query has dropped due to contrack table is full, the rest queries after adding the other filters will shows ”NXDOMAIN“ which will lead at the end to unresolved query.

You can check this article [1] For more information about exceeding-network-limits.

To check it you can run the below command in your instance, but yoi need to make sure that ethool is installed.

ethtool -S eth0 | egrep "linklocal_allowance_exceeded|bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|pps_allowance_exceeded"
bw_in_allowance_exceeded:0     
bw_out_allowance_exceeded: 0     
pps_allowance_exceeded: 0     
conntrack_allowance_exceeded: 0     
linklocal_allowance_exceeded: 0

Resolution

Choose the right instance that fits your networking requirements also consider network optimised instances.
Implement NodeLocal DNSCache as mentioned above and also check the motivations behind nodelocal cache dns here [3].
implement a DNS retry mechanism on the application.

For more information about DNS tuning, check this article [3].

2. Abrupt vanishing of resources

Abrupt vanishing of resources that haven’t undergo safe eviction processes could lead to intermittent DNS issue, For example at the termination of pod the API server will receive either an update from a controller, kubelet, or kubectl client to terminate the pod → The EP controller will remove the pod from the endpoint first → then kube-proxy will remove the ip-table rules for this particular pod → then the kubelet will delete pod.

So if the node get deleted suddenly before the EP controller remove the pods endpoints, It will make an intermittent dns because the service still pointing to a pod’s EP that is no longer exist and the EP controller doesn’t have any clue that pod is already got deleted.

in case of using spot instances, the instance will be forcibly interrupted after interruption notice without safely evicting their pods. Also based on this documentation [4], the coredns shouldn’t be run on spot instances.
in case of using self-managed nodes, you have to create your own EC2_INSTANCE_LAUNCHING and EC2_INSTANCE_TERMINATING, by default managed nodegroups comes out without EC2_INSTANCE_LAUNCHING and EC2_INSTANCE_TERMINATING Lifecycle hooks.

To make sure that intermittent dns happened due to spot interruption, filter your cloudtrail by event name “BidEvictedEvent”

Resolution

In case of you are using spot instances, make sure to deploy node-termination handler(NTH) to respond to Spot Interruptions. for more information about how to deploy it, please visit this link [5].
Also make sure that you have ASG Lifecycle hooks to make sure that your node are safely terminated.
Check if there is any nodes get terminated, know why it is terminated, check it in your cloudtrail “TerminateInstances” event and your autoscaling group activities, is it terminated manually or due to EC2 instance status checks failure.

3. Missing/Incorrect configuration

A.) possibility of intermittent dns when Using multiple nodegroup managed or self-managed with different security groups and NACLs.

if you have different securitygroups in each nodegroup, make sure that alll. the nodegroups are able to communicate with each other. For example, if your coredns and your pods distributed in multiple different nodegroups and these nodegroup are not connected with each other, the nodes that are in the same nodegroup will be able to communicate with each other only but will not with the other nodegroups.

B.) possibility of intermittent dns when Using Securitgroup for pod

The VPC resource controller creates branch interfaces with a separate Securitygroup for the pod, make sure that the pod is able to access to all the coredns pods all over the cluster either deployed over single or multiple nodegroup.

C.) possibility of intermittent dns when setting the ndots value inappropriately

lower number of the ndots can significantly reduces the number of DNS queries, but in the meanwhile it can cause a dns resolution issues if not wisely used, for example if ndots=1, requests to test-app.namespace will not be resolved because coredns thinks that this is the “fully qualified domain name”.

So you need to use ndots wisely and depends on your use case, if your application need internal communication with other internal service you will have to use the expanded internal domain syntax — app.namespace.svc.cluster.local. and the default ndots (ndots=5) is needed.

D.) possibility of intermittent dns issue when using both drain and terminate on spot interruptionas well as drain and termination on spot rebalance recommendations.

These two components do not share information between each other, meaning if you have drain and terminate functionality enabled on NTH, NTH may remove a node for a spot rebalance recommendation. Karpenter will replace the node to fulfill the pod capacity that was being fulfilled by the old node; however, Karpenter won’t be aware of the reason that that node was terminated. This means that Karpenter may launch the same instance type that was just deprovisioned, causing a spot rebalance recommendation to be sent again. This can result in very short-lived instances where NTH continually removes nodes and Karpeneter re-launches the same instance type over and over again.

So potentially causing more node churn in the cluster than with interruptions alone, that is why Karpenter doesn’t recommend reacting to spot rebalance recommendations when running Karpenter with spot nodes;

4. Bug

Intermittent dns issue may occur due to a bug in the software that you are using, so you need to check first if there is a known Intermittent dns issue for the software that you are using. here is an example [6].

How To Approach Intermittent DNS Issue In EKS

1. Limitations

A.) Check if DNS queries reached the limit of 1024 DNS query packet per second (PPS)

Use tcpdump (Linux only)

Resolution

Implement NodeLocal DNSCache .

B.) having network intensive application with intensive east-west traffic

Resolution

2. Abrupt vanishing of resources

Resolution

3. Missing/Incorrect configuration

A.) possibility of intermittent dns when Using multiple nodegroup managed or self-managed with different security groups and NACLs.

B.) possibility of intermittent dns when Using Securitgroup for pod

C.) possibility of intermittent dns when setting the ndots value inappropriately

D.) possibility of intermittent dns issue when using both drain and terminate on spot interruptionas well as drain and termination on spot rebalance recommendations.

4. Bug

Comments

More from this blog

Kubetools - A Curated List of Kubernetes Tools

Terraform Module for deploying an AKS cluster

The incredible HULL - Helm Uniform Layer Library - is a Helm library chart to improve Helm chart based workflows

The incredible HULL - Helm Uniform Layer Library - is a Helm library chart to improve Helm chart based workflows

HULL - Helm Uniform Layer Library

Command Palette

1. Limitations

A.) Check if DNS queries reached the limit of 1024 DNS query packet per second (PPS)

Use tcpdump (Linux only)

Resolution

Implement NodeLocal DNSCache .

B.) having network intensive application with intensive east-west traffic

Resolution

2. Abrupt vanishing of resources

Resolution

3. Missing/Incorrect configuration

A.) possibility of intermittent dns when Using multiple nodegroup managed or self-managed with different security groups and NACLs.

B.) possibility of intermittent dns when Using Securitgroup for pod

C.) possibility of intermittent dns when setting the ndots value inappropriately

D.) possibility of intermittent dns issue when using both drain and terminate on spot interruptionas well as drain and termination on spot rebalance recommendations.

4. Bug

Comments

More from this blog