Skip to main content

Stream IoT data to S3 - the simple way

First, a short introduction to infinimesh, an Internet of Things (IoT) platform which runs completely in Kubernetes
infinimesh enables the seamless integration of the entire IoT ecosystem independently from any cloud technology or provider. infinimesh easily manages millions of devices in a compliant, secure, scalable and cost-efficient way without vendor lock-ins.
We released some plugins over the last weeks - a task we had on our roadmap for a while. Here is what we have so far:

  • Elastic
    Connect infinimesh IoT seamless into Elastic.

  • Timeseries
    Redis-timeseries with Grafana for Time Series Analysis and rapid prototyping, can be used in production when configured as a Redis cluster and ready to be hosted via Redis-Cloud.

  • SAP Hana
    All code to connect infinimesh IoT Platform to any SAP Hana instance

  • Snowflake
    All code to connect infinimesh IoT Platform to any Snowflake instance.

  • Cloud Connect
    All code to connect infinimesh IoT Platform to Public Cloud Provider AWS, GCP and Azure. This plugin enables customers to use their own cloud infrastructure and extend infinimesh to other services, like Scalytics, using their own cloud native data pipelines and integration tools.
We have chosen Docker as main technology, because it enables our customers to run their own plugins in their own space in their controlled environment. And, since our plugins don't consume so much resources, they fit perfectly into the free tiers of AWS EC2 - I use them here in that blog post.
The plugin repository was structured with developer friendliness in mind. All code is written in Go, and the configuration will be done on dockerfiles. Since you need to put credentials into, we highly advise to run the containers in a controlled and secure environment. 
infinimesh UI

Stream IoT data to S3

Here I like to show how easy it is to combine IoT with already installed infrastructures in public clouds. The most used task we figured is the data stream to S3; most of our customers use S3 either directly with AWS, or by implementing their own object storage using the S3 protocol, like MinIO - which is also Kubernetes native.

Of course a private installation of infinimesh or accounts on infinimesh.cloud and AWS are needed, if using the cloud version of both. Here is a screenshot from the SMA device I used to write this post:

Preparation

  1. Spin up an EC2 instance in the free tier with Linux, a t2.micro instance should fit mostly all needs
  2. Log into the VM and install docker as described in the AWS documentation: Docker basics for Amazon ECS - Amazon Elastic Container Service
  3. Install docker-compose and git:

    sudo curl -L \
    https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)\
    -o /usr/local/bin/docker-compose \
    && sudo chmod +x /usr/local/bin/docker-compose \
    && sudo yum install git -y

That’s all we need as preparation, now log-off and login again to enable the permissions we have set earlier. 

Setup and Run

  1. Clone the plugin - repo:
    git clone https://github.com/infinimesh/plugins.git
  2. Edit the CloudConnect/docker-compose.yml and replace CHANGEME with your credentials
  3. Compose and start the connector (-d detaches from the console and let the containers run in background):
    docker-compose -f CloudConnect/docker-compose.yml --project-directory . up --build -d
  4. Check the container logs:
    docker logs plugins_csvwriter_1 -f
We used Go as development language, therefore the resource consumption is low:


After one minute the first CSV file should be arriving in S3. That’s all - easy and straightforward.



  Some developer internals

We have built some magic around to make the use of our plugins as easy as possible for customers and at the same time easy to adapt for developers.
  

How it works:

First we iterate over /objects, finding all endpoints marked with [device], call the API for each device and store the data as a sliding window into a local redis store, to buffer network latency. After some seconds we send the captured data as CSV to the desired endpoints. In our tests we transported data from up to 2 Million IoT devices over this plugin, each of those devices send every 15 seconds ten key:value pairs as JSON.

Comments

Post a Comment

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at Log.read() at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also zkCli.sh can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this i

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

Life hacks for your startup with OpenAI and Bard prompts

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup? Even creating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions. Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now you have laid a grea