Customer: The pioneer of Electric vehicles and related technologies in India.
All Customer’s vehicles are equipped with Internet of Things (IoT) sensors and the data collected by them is used to improve and track performance. Majority of the connected car services require a bi-directional communication between the car and the cloud. Cars send data to the cloud and enable apps like predictive maintenance, assisted driving, etc. Similarly, the car needs to be able to receive messages from the cloud to respond to remote commands, like charging of battery, remote lock/unlock door and remote activation of horn or lights. While scalable web technology, like TCP/IP can be implemented for car-to-cloud communication, however, implementing the cloud-to-car communication would require static IP addresses for each car in the system. This is not possible since cars move through cellular networks where there is no single IP address for each device. Other technical challenges for connected car services include unreliable connectivity, network latency and security.
MQTT addresses many of the challenges of creating scalable and reliable connected car services by enabling a persistent always-on connection between the car and cloud. When a network connection is available, a vehicle will publish data to the MQTT broker and will receive subscribed data from the same broker in near real-time. If a network connection is not available, the vehicle will wait until the network is available before attempting to transmit data. While the vehicle is offline, the broker will buffer data, and as soon as the vehicle is back online, it will immediately deliver the data. MQTT’s advanced message retention policies and offline message queuing are essential to accommodating network latency and unreliable mobile networks. MQTT brokers can be deployed to cluster nodes running on a private or public cloud infrastructure. This allows the broker to scale up and down depending on the number of vehicles trying to connect. MQTT is a secure protocol as each car is responsible for establishing a secure persistent TCP connection, using TLS, with the MQTT broker in the cloud. This means no public Internet endpoint is exposed on the car so no one can directly connect to the car. This makes it virtually impossible for a car to be directly attacked by a hacker on the Internet.
Almost 65% of the current Customer fleet operate on legacy platforms which send sensor data on TCP/IP instead on MQTT. The major challenge with the current architecture is that critical notifications like low battery, door open etc. will take ~10 minutes to reach the customer. The customer wanted to reduce the Turnaround Time (TAT) to near real time. Going forward, all new cars are expected to support MQTT and Websockets. All new sensors support updates via OTA or SMS etc. Secure File Transfer Protocol (SFTP) support is also there to download the updates.
The following sections explain in detail the current architecture:
- Sensor data size over TCP is ~360 bytes and over MQTT is ~440 bytes.
- IP whitelisting is done with the Azure IPs for authentication in the IoT sensors during the manufacturing and assembly stage
- Currently, Azure has a TCP/IP Gateway and MQTT Gateway server with Parsers running which pushes all the IoT time series data to the Cassandra Database. The Gateway and Parser applications are all Java-based. The application is services based and its written in NodeJS.
- 21 services are currently running. Among which only 8 are containerized, remaining are running on plain VMs as Node applications. Multiple services are running on the same VM on Azure. Docker Swarm is being used for container orchestration.
- Consul is used for service delivery and storing key-value pairs required for OAuth service.
- There is an API Gateway service which connects to 2 backend secure kong containers.
- The customer uses Redis to store user sessions. The key expire event is used to trigger notifications on schedule. RabbitMQ to store messages and ELK for log management and creating custom reports. All these are running as a Docker Containers.
- 2 databases used are 3 Node Cassandra cluster and PostgreSQL. Cassandra is mostly to store the Time Series data from IoT Sensor. PostgreSQL database contains customer profile data and vehicle data and is mostly used by the Payment Microservice. All transactional data is stored in PostgreSQL and services access them. Total database size of Cassandra is ~120 GB while PostgreSQL is ~150 MB.
- All application microservices and MQTT/TCP IoT brokers will be containerized and deployed on AWS Fargate.
- All latest IoT sensor data will be sent to AWS environment. IoT Sensor data will be pushed to a Kinesis stream and a Lambda function to query the stream to find the critical data(low battery, door open etc.) and call the notification microservice.
- Old sensor data to be sent to Azure environment initially due to existing public IP whitelisting. MQTT bridge & TCP port forwarding to be done to proxy the request from Azure to AWS. Once the old sensors are updated fully cut-over to AWS.
The important steps in the architecture are explained below:
- IAM roles will be created to access different AWS service.
- Network will be setup using the VPC service. Appropriate CIDR range, subnets, route tables etc. will be created.
- NAT Gateway will be setup to enable internet access for servers in the private subnet.
- All Docker Images will be stored in Elastic Container Registry (ECR).
- AWS ECS Fargate will be used to run the Docker Containers. ECS Task Definitions will be configured for each container to be run.
- AWS ECS – Fargate will be used to deploy all the container images on the Worker Nodes. In Fargate the Control Plain and Worker Nodes are managed by AWS. The Scaling, highly available (HA) services and patching is handled by AWS. Application load balancer will be deployed as the front end to all the application Microservices. ALB will forward the request to the Kong API Gateway which in turn will route the requests to the microservices.
- Service level scaling will be configured in Fargate for more containers to spin up based on load.
- Elasticache service with Redis Engine will be deployed across multiple Availability Zone (AZ) for HA. Elasticache is a managed service from AWS, where HA, patching, updates etc. is managed by AWS.
- Aurora PostgreSQL will be used to host the PostgreSQL Database. SQL Dump will be taken from Azure PostgreSQL VM and then restored on Aurora.
- 3 Node Cassandra cluster will be setup across multiple AZs in AWS for HA. 2 Nodes will be running in one AZ and another node in the second AZ.
- A 3 Node Elasticsearch cluster will also be setup using the Managed Services from AWS. When using Elasticsearch service of AWS all the nodes are managed by AWS.
Bi-directional notification workflow is explained below:
- TCP & MQTT gateways will be running on EC2 machines and Parser application on a different EC2 instance.
- AWS Public IP addresses will be whitelisted on the IoT Sensor during manufacturing for the device to securely connect to AWS.
- The Gateway Server will push the raw data coming from the sensors to a Kinesis Stream.
- The Parser server will push the converted/processed data to same/another Kinesis stream.
- Lambda function will query the data in the Kinesis stream to find the fault/notification type data and will invoke the notification Microservice/ SNS to notify the customer. This reduces the current notification time from 6-8 minutes to almost near real time.
- We will have Kinesis Firehose as a consumer reading from the Kinesis streams to push processed data to a different S3 bucket.
- Another Firehose will push the processed data to Cassandra Database and a different S3 bucket.
- AWS Glue will be used for data aggregation previously done using Spark jobs and push the data to a separate S3 bucket.
- Athena will be used to query on the S3 buckets. Standard SQL queries works with Athena. Dashboards will be created using Tableau.
Cassandra, Amazon Kinesis, Amazon Redshift, Amazon Athena, Tableau.
The customer vehicles are able to send/receive notifications in real-time. Using AWS, applications are able to scale on a secure, fault-tolerant, and low-latency global cloud. With the implementation of Continuous Integration (CI)/Continuous Delivery (CD) pipeline, the customer team is no longer spending its valuable time on mundane administrative tasks. Powerup helped customer achieved its goal of securing data, while lowering cloud bills and simplifying compliance.