2pk03 over AI, ML, BigData and data processing

AI's False Reality: Understanding Hallucination

2024-04-05T09:03:00.005+02:00

Artificial Intelligence (AI) has leapfrogged to the poster child of technological innovation, on track to transform industries in a scale similar to the Industrial Revolution of the 1800s. But in this case, as cutting-edge technology, AI presents its own unique challenge, exploiting our human behavior of "love to trust", we as humans face a challenge: AI hallucinations.

This phenomenon, where AI models generate outputs that are factually incorrect, misleading, or entirely fabricated, raises complex questions about the reliability and trust of AI models and larger systems.

The tendency for AI to hallucinate comes from several interrelated factors. Overfitting – a condition where models become overly specialized to their training data – can lead to confident but wildly inaccurate responses when presented with novel scenarios (Guo et al., 2017). Moreover, biases embedded within datasets shape the models' understanding of the world; if these datasets are flawed or unrepresentative, hallucinations can become a vehicle for perpetuating harmful stereotypes and discrimination (Gebru et al., 2018).

Another significant source of hallucinations is distributional shift – the discrepancy between the data an AI model encounters during training and the messy, unpredictable nature of real-world situations (Quionero-Candela et al., 2009). Even with extensive training, AI models may lack a comprehensive understanding of specific topics, leading them to improvise and provide fabricated information. Additionally, AI systems are vulnerable to adversarial attacks, where meticulously crafted inputs are designed to intentionally induce hallucinations (Goodfellow et al., 2015).

Hallucination in reality

An example (use your favorite AI):

Prompt: "What happened around the date the Berlin Wall was built?"

Perplexity:

"The Berlin Wall was built around the date of August 13, 1961, as a result of a decree passed by the East German government on August 12. The primary intention for the Wall's construction was to prevent East German citizens from fleeing to the West ......" => Output

Gemini Advanced:

"The construction of the Berlin Wall was a pivotal moment in Cold War history. Here's a breakdown of the key events leading up to it and the immediate aftermath...." => Output

ChatGPT 3.5:

"The Berlin Wall was erected on August 13, 1961. This event marked a significant moment in the history of the Cold War, a period of geopolitical tension between the Western Bloc....." => Output

Now, what happened around the date the Berlin Wall was built? Here's a great, neutral breakdown: https://mashable.com/feature/jumping-the-berlin-wall

As this example shows, AI tries to combine things which aren't relevant, builds some educational frame around, which might be biased towards a public meaning and missing significant pieces of information.

Hallucinations in systems like self-driving cars or medical diagnostics could have devastating consequences. The use of AI in military operations has already started and becomes more or less a standard, combined with hallucinations it will have devastating outcomes, when not tackled accordingly (https://www.defense.gov/News/News-Stories/Article/Article/3597093/us-endorses-responsible-ai-measures-for-global-militaries/).

The problems associated with AI generated manipulations are present and increasing (https://reutersinstitute.politics.ox.ac.uk/news/how-ai-generated-disinformation-might-impact-years-elections-and-how-journalists-should-report). The spread of AI-generated misinformation undermines public meaning, increases fear to trust the technology, manipulates the adoption. This leads to the public assumption that "AI will kill humans" (Google search), which is mainly driven by science fiction like "Terminator".

Though complete elimination of AI hallucinations may be unrealistic, researchers are focusing on strategies to manage the problem. The creation of large, meticulously balanced datasets that accurately reflect diverse real-world scenarios is essential for improving AI generalization. But who balances the data? We. In our daily routine, in our personal view about the world.

Regularization techniques can help prevent overfitting, encouraging models to learn broader patterns. Perhaps most crucially, teaching AI models to express their uncertainty can provide users with a valuable tool to gauge the reliability of outputs (Kendall and Gal, 2017). Integrating mechanisms like fact-checking and grounding AI to trusted, neutral knowledge bases enables additional safeguards against hallucinations.

References:

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on MachineLearning (pp. 1321-1330).
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2018). Datasheets for datasets. arXiv preprint arXiv:1803.09010.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2009). Dataset shift in machine learning. The MIT Press.
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.

Additional links:

https://alc.isis.vanderbilt.edu/redmine/attachments/download/66/Assurance_AIEDAM_Dimitrios.pdf

https://knowledgehub.transparency.org/assets/uploads/kproducts/Algorithmic-Transparency_2021.pdf

https://link.springer.com/article/10.1007/s10994-022-06132-9

When to Choose ETL vs. ELT for Maximum Efficiency

2024-03-28T08:47:00.000+01:00

ETL has been the traditional approach, where data is extracted, transformed, and then loaded into the target database. ELT flips this process - extracting data and loading it directly into the system, before transforming it.

While ETL has been the go-to for many years, ELT is emerging as the preferred choice for modern data pipelines. This is largely due to ELT's speed, scalability, and suitability for large, diverse datasets generated by multiple different tools and systems, think about CRM, ERP datasets, log files, edge computing or IoT. List goes on, of course..

Data Engineering Landscape

Data engineering is the new kind of DevOps. With the exponential growth in data volume and sources, the need for efficient and scalable data pipelines and therefore data engineers has become the new standard.

In the past, limitations in compute power, storage capacity, and network bandwidth made the famous 3-word "let's move data round" phrase Extract, Transform, Load (ETL) the default choice for data processing. ETL allowed data engineers to shape and clean data before loading it into warehouses and databases. This minimized infrastructure costs.

Cloud data warehouses such as Snowflake, BigQuery, and Redshift are changed the game in past years. Modern data platforms offer virtually unlimited storage and compute, along with flexibility to scale up and down on demand. But they also come with a cost factor, plus the problem of ETL.

As a result, Extract, Load, Transform (ELT) is now the preferred approach for building data pipelines. ELT focuses on fast ingestion of raw data into data lakes and warehouses, deferring transformations to later stages. This unlocks quicker insights, greater agility, and lower costs for organizations, plus accelerates the move from DevOps (ETL) to DataOps (ELT) setups. And with data continuing to grow exponentially, data engineers now require scalable and flexible architectures centered around ELT to create future-proof pipelines. The ability to efficiently store, process, and analyze vast amounts of raw data is becoming critical.

ETL Explained

ETL (Extract, Transform, Load) is a data integration process that involves extracting data from source systems, transforming it to fit analytical needs, and loading it into a data warehouse or other target system for analysis.

The key steps in ETL are:

Extract - Data is extracted from homogeneous or heterogeneous sources like databases, CRM systems, social media, etc. The data can be structured, semi-structured or unstructured.

Transform - The extracted data is transformed to meet the requirements of the target system. This involves data cleaning, filtering, aggregation, splitting, joining, formatting, validating, and applying business rules.

Load - The transformed data is loaded into the data warehouse or other target database. This makes the data available for data mining, analytics, reporting and dashboards.

Some of the pros of ETL include:

Mature technology with many tools and expertise available
Handles complex transformations efficiently, especially for smaller datasets
Allows for data cleaning and preparation before loading into target
Facilitates data integration across disparate sources and formats

Some of the cons are:

Batch-oriented process, can't handle real-time data
Requires separate environment for transformations increasing complexity
Difficult to modify pipelines for new requirements
Not ideal for large volumes of data

ETL is commonly used in data warehousing and business intelligence to prepare integrated, consistent and cleansed data for analytics and reporting. It continues to be relevant today, especially when complex transformations are needed before loading data into relational data warehouses.

ELT Explained

ELT stands for Extract, Load, Transform. It is a process for moving data into a data warehouse or other target system.

The key steps in ELT are:

Extract - Data is extracted from various sources such as databases, APIs, files, etc.

Load - The extracted raw data is loaded directly into the target system such as a data warehouse or data lake, without any transformations.

Transform - Once the data is loaded, transformations and cleansing happen within the target system to prepare the data for analysis and reporting.

Pros of ELT:

Faster loading since no time spent on transformations beforehand. This improves overall processing speed.
Flexibility to transform data on an as-needed basis depending on downstream requirements.
Scales well with large datasets as loading is not bottlenecked by transformations.
Cost-effective as less processing power needed upfront.
Works well with unstructured and semi-structured data.

Cons of ELT:

Security and compliance issues as raw data is loaded which may contain sensitive information.
Requires availability of powerful target system to handle transformations after loading.
May be challenging to find experts with ELT skills since it is a relatively new approach.

Use cases:

Loading data into a data lake where schema-on-read is applied after loading.
Ingesting unstructured or semi-structured web, social media, IoT data.
Quickly pre-loading raw datasets before applying complex transformations.
Frequent loading of streaming data from sources like sensors, mobile devices etc.

Key Differences Between ETL and ELT

When deciding between ETL and ELT, it is important to understand the key differences between the two approaches:

Factor	ETL	ELT
Efficiency	Less efficient for large datasets. Transformation before loading adds time.	More efficient for large datasets. Faster loading, transformation happens later.
Costs	Can be more costly (hardware needed for upfront transformations).	Lower costs (less upfront processing power needed).
Flexibility	Less flexible. If new uses emerge, re-extraction and transformation is required.	More flexible. Raw data allows adapting transformations as needed.
Scalability	Difficult to scale with large, growing datasets. Transformations can bottleneck.	Scales well as loading is not slowed by transformations.
Big Data	Not ideal for large, unstructured datasets.	Better suited for unstructured data. Transformations easier after loading.
Data Quality	May provide higher quality data (transformations happen upfront).	Lower quality initially as raw data is loaded without adjustments.
Security & Compliance	Sensitive data can be transformed prior to warehouse loading.	Raw data loaded first, extra care needed for security and compliance.
Skill Set	ETL experts widely available. Mature tools.	Newer, so finding ELT skilled resources may be harder. Tools evolving.

In summary, while ETL is made for small, structured data that requires complex transformations (old data warehouses typically have only structured data, pressed into a schema), ELT is the better choice for large, diverse big data sets due to its flexibility, scalability and efficiency.

Why Should I Use ELT Now?

1. Increased Speed and Efficiency

ELT allows for much faster data ingestion and processing compared to traditional ETL pipelines. Since transformations are done after loading the raw data into the data warehouse, the initial data intake is streamlined. This difference is especially impactful when working with massive datasets, where ETL can become bottlenecked. With ELT, you can load terabytes of raw data quickly into cloud data warehouses like Snowflake, then transform it later.

2. Flexibility

Storing the raw data directly in the warehouse provides more flexibility. Data can be transformed differently depending on the specific analytical needs, without having to repeatedly extract data from the source systems. ELT facilitates easy integration of new data sources and types into the pipeline. The raw data acts as a central source, which can then be transformed and structured as needed.

3. Performance and Cost-Effectiveness

ELT reduces the need for heavy transformation processing on the frontend, lowering the infrastructure costs. The raw data intake is fast and lightweight, while leveraging the scalable processing power of cloud data warehouses for transformations afterwards. This makes ELT a very cost-effective model, particularly when dealing with massive datasets. The pay-as-you-go nature of cloud data warehouses complements this nicely.

ETL vs ELT In My Project

The choice between ETL and ELT depends on the specific data infrastructure, data types, and use cases. Here are some guidelines on when to choose one over the other:

ETL is a good choice:

The data requires complex transformations before analysis. ETL allows cleaning and transforming data before loading into the warehouse.
Compliance and data privacy are critical. ETL enables transforming sensitive data to ensure compliance before making it available for analytics.
The existing infrastructure relies on a traditional data warehouse. ETL is optimized for loading data into relational database systems.
The dataset is relatively small. ETL can efficiently handle small, complex datasets.
Data quality is a high priority. ETL allows thoroughly validating, cleaning, and transforming data for consistency before loading.

ELT is a better choice when:

Working with big data from diverse sources. ELT efficiently loads high volumes of structured, semi-structured, and unstructured data.
Flexibility in analysis is needed. Storing raw data allows analysts to transform it differently for various needs.
The infrastructure relies on a data lake. ELT integrates well with data lake architectures.
Real-time analytics is required. Loading data first enables faster queries for real-time insights.
Scalability is important as data volumes grow. ELT scales seamlessly with increasing data.
Cost needs to be minimized. ELT requires less processing power and is cost-effective.

So in summary, ETL adds more value when data quality and complex transformation are critical before analysis. ELT provides advantages when working with diverse big data sources and flexibility in analytics is important.

Some key points:

ETL involves extracting, transforming and then loading data into the target system. It works well for handling complex transformations with smaller, structured datasets.

ELT prioritizes loading data first, then transforming after. It is ideal for large, diverse datasets including unstructured data.

ETL offers benefits like data compliance, efficiency with complex transformations, and mature technology.

ELT benefits include speed, flexibility, scalability, cost-effectiveness and suitability for big data.

Factors like data volume and variety, infrastructure, compliance needs, and transformation complexity can dictate the best approach. And don't forget talent and integration costs. Investing into better and faster data management tools makes you fit for the next years, and reduces technical debt. Data pipelines are the underlaying working horse for data analytics, ML and AI. Setting on the older horse doesn't makes you win ;)

Life hacks for your startup with OpenAI and Bard prompts

2023-07-15T00:00:00.018+02:00

OpenAI and Bard are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup? Even creating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions.

Introduce yourself and your startup

Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot:

Prompt: I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY.

Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now you have laid a great foundation for the model to understand you and what you are doing.

Mission statement

Prompt: Our current mission/vision is: YOUR VISION / MISSION. Help me enhance it by making it clearer, and more inspiring.

If you don't have a mission yet, use this prompt:

I think about a compelling mission and vision statement, help me draft some.

Buyers persona

Prompt: Help me create a buyer persona for my product PRODUCT NAME that we should approach; this persona needs to have the power to buy products for $500k with one check.

Prompt: Help me create a buyer-supporting persona for PRODUCT NAME we should approach. Those who are primarily INDUSTRY AREA and are familiar with WHAT YOU WANT TO DISRUPT and related activities.

USP

Prompt: We are developing a product called PRODUCT NAME. It is a PRODUCT IDEA that offers WHATEVER YOU PROVIDE. I need you to assist me in crafting a compelling and concise description that highlights its unique selling proposition.

Elevator Pitch

Prompt: Help me draft a convincing elevator pitch for PRODUCT NAME with a professional (or convincing) tone.

Market size, SOM and SAM

Prompt: How big is the market size for PRODUCT DESCRIPTION / PRODUCT NAME, and what could be our serviceable and obtainable market size for the COUNTRY or region?

This is quite an interesting prompt; the first answer will never match; you have to improve the answer by defining more parameters. As an example: I wanted to know how big the market size for our product is; we target the US Educational market.

To wrap it up

Now, I think you've got it. Also note that when you start to chat with one of the tools, they mostly know nothing about you, the company, or anything else. That means introducing, explaining, and improving the answers, like:

That was not what I wanted to know, let me rephrase.

Now rephrase your question: that triggers some kind of reinforcement learning, and the AI might be able to pull more and better information. Every rephrase and piece of information helps nail down the best response for you. I hope that helps a bit in improving your sales and marketing efforts for your startup.

Indexing PostgreSQL with Apache Solr

2023-07-12T18:05:00.001+02:00

Searching and filtering large IP address datasets within PostgreSQL can be challenging. Why? Databases excel at data storage and structured queries, but often struggle with full-text search and complex analysis. Apache Solr, a high-performance search engine built on top of Lucene, is designed to handle these tasks with remarkable speed and flexibility.

What do we need?

A running PostgreSQL database with a table containing IP address information (named "ip_loc" in our example).
A basic installation of Apache Solr.

Setting up Apache Solr

Create a Solr Core:

Bash

solr create -c ip_data -d /path/to/solr/configsets/

Define the Schema (schema.xml)

XML

<field name="start_ip" type="ip" indexed="true" stored="true"/>
<field name="end_ip" type="ip" indexed="true" stored="true"/>
<field name="iso2" type="string" indexed="true" stored="true"/>
<field name="state" type="text_general" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true"/>

Integrating PostgreSQL and Solr

Solr's DataImportHandler (DIH): Add the following DIH configuration to your solrconfig.xml:

XML

<dataConfig>
    <dataSource type="JdbcDataSource" 
                driver="org.postgresql.Driver"
                url="jdbc:postgresql://localhost/your_database"
                user="your_username" 
                password="your_password"/>
    <document>
        <entity name="ip_data" query="SELECT * FROM ip_loc">
            <field column="start_ip" name="start_ip" /> 
            </entity>
    </document>
</dataConfig>

Import Data: Initiate the data import using the Solr admin interface or the command line:
Bash
```
http://localhost:8983/solr/ip_data/dataimport?command=full-import
```

Querying Solr

IP Range Search: start_ip:[192.168.0.1 TO 192.168.255.255]
Geolocation Filtering: iso2:US AND state:California
Combined Search: city:NewYork AND start_ip:[10.0.0.0 TO 10.255.255.255]

Benefits vs. Pure PostgreSQL

Performance: Solr's inverted indexes provide superior search speed.
Scalability: Solr easily distributes across multiple machines.
Flexibility: Solr's query syntax offers rich search capabilities.

My take

By combining PostgreSQL and Apache Solr, you create a robust IP address management system that scales efficiently while providing lightning-fast search functionality.

Some fun with Apache Wayang and Spark / Tensorflow

2023-01-18T14:57:00.002+01:00

Apache Wayang is an open-source Federated Learning (FL) framework developed by the Apache Software Foundation. It provides a platform for distributed machine learning, with a focus on ease of use and flexibility. It supports multiple FL scenarios and provides a variety of tools and components for building FL systems. It also includes support for various communication protocols and data formats, as well as integration with other Apache projects such as Apache Kafka and Apache Pulsar for data streaming. The project aims to make it easier to develop and deploy machine learning models in decentralized environments.

It's important to note that this are just examples and they may not be the way for your project to interact with Apache Wayang, you may need to check the documentation of the Apache Wayang project (https://wayang.apache.org) to see how to interact with it. I just point out how easy it is to use different languages to interact between Wayang and Spark.

Also, you need to make sure that you have the correct permissions and credentials to interact with the Wayang API and make changes to the Spark cluster.

Wayang - Scala - Spark:

import org.apache.wayang.{Wayang, WayangClient}

class SparkScaler(wayangUrl: String) {

val wayang = new WayangClient(wayangUrl)

def scaleUp(numWorkers: Int): Unit = {

wayang.addWorkers(numWorkers)

}

def scaleDown(numWorkers: Int): Unit = {

wayang.removeWorkers(numWorkers)

}

The SparkScaler class takes a single parameter, the URL of the Wayang API endpoint, when it is initialized. The scaleUp() method can be called to add a specified number of workers to the Spark cluster, and the scaleDown() method can be called to remove a specified number of workers.

Wayang - Python - Spark

from apache_wayang import Wayang

class SparkScaler:

def __init__(self, wayang_url):

self.wayang = Wayang(wayang_url)

def scale_up(self, num_workers):

self.wayang.add_workers(num_workers)

def scale_down(self, num_workers):

self.wayang.remove_workers(num_workers)

The SparkScaler class takes a single parameter, the URL of the Wayang API endpoint, when it is initialized. The scale_up() method can be called to add a specified number of workers to the Spark cluster, and the scale_down() method can be called to remove a specified number of workers.

Wayang - Java Streams - Spark

import org.apache.wayang.WayangClient;

import java.util.stream.IntStream;

public class SparkScaler {

private WayangClient wayang;

public SparkScaler(String wayangUrl) {

wayang = new WayangClient(wayangUrl);

}

public void scaleUp(int numWorkers) {

IntStream.range(0, numWorkers).forEach(i -> wayang.addWorker());

}

public void scaleDown(int numWorkers) {

IntStream.range(0, numWorkers).forEach(i -> wayang.removeWorker());

}

Iterate the K-Means clustering algorithm from Apache Wayang to TensorFlow

import org.apache.wayang.WayangClient;

import org.tensorflow.Graph;

import org.tensorflow.Session;

import org.tensorflow.Tensor;

public class KMeansIteration {

private WayangClient wayang;

private Graph graph;

private Session session;

public KMeansIteration(String wayangUrl, String modelPath) {

wayang = new WayangClient(wayangUrl);

graph = new Graph();

graph.importGraphDef(modelPath);

session = new Session(graph);

}

public void iterate(Tensor input) {

Tensor wayangOutput = wayang.runKMeans(input);

Tensor tfOutput = session.runner().feed("input", wayangOutput).fetch("output").run().get(0);

// Perform further processing on tfOutput

}

The KMeansIteration class takes two parameters, the URL of the Wayang API endpoint and the path of the TensorFlow model, when it is initialized. The iterate() method can be called with an input Tensor, it will pass it to the Wayang's K-Means clustering algorithm, it will receive the output, and then will pass it to the TensorFlow's model as an input.

That's are only examples to show how easy it can be to get started with FL and also get involved into Wayang as a developer. Also consider to contribute to the project, check the project under wayang.apache.org

Get Apache Wayang ready to test within 5 minutes

2022-09-22T14:48:00.006+02:00

Hey followers, I often get ask how to get Apache Wayang (https://wayang.apache.org) up and running without having a full big data processing system behind. We heard you, we built a full fledged docker container, called BDE (Blossom Development Environment), which is basically Wayang. Here's the repo: https://github.com/databloom-ai/BDE

I made a short screencast how to get it running with Docker on OSX, and we also have made two hands-on videos to explain the first steps. Let's start with the basics - Docker. Get the whole platform with:

docker pull ghcr.io/databloom-ai/bde:main

At the end the Jupyter notebook address is shown, control-click on it (OS X); the browser should open and login you automatically:

Voila - done. You have now a full working Wayang environment, we prepared three notebooks to make it more easy to dive into. Watch our development tutorial video (part 1) to get a better understanding what Wayang can do, and what not. Click the video below:

Combined Federated Data Services with Blossom and Flower

2022-07-27T17:40:00.001+02:00

When it comes to Federated Learning frameworks we typically find two leading open source projects - Apache Wayang [2] (maintained by databloom) and Flower [3] (maintained by Adap). And at the first view both frameworks seem to do the same. But, as usual, the 2nd view tells another story.

How does Flower differ from Wayang?

Flower is a federated learning system, written in Python and supports a large number of training and AI frameworks. The beauty of Flower is the strategy concept [4]; the data scientist can define which and how a dedicated framework is used. Flower delivers the model to the desired framework and watches the execution, gets the calculations back and starts the next cycle. That makes Federated Learning in Python easy, but also limits the use at the same time to platforms supported by Python.

Flower has, as far as I could see, no data query optimizer; an optimizer understands the code and splits the model into smaller pieces to use multiple frameworks at the same time (model parallelism).

And here we have the ideal touchpoint between Blossom and Flower.

Combine Blossom and Flower and build a Federated Data Science NLP Stack

How to build a chatbot system, which serves multiple functions and customers across the world, like in a bank? A chatbot stack typically uses NLP combined with multiple data source to provide a natural communication between humans and machines. The demand of Machine-Human interaction and human based communication has considerably increased and the forecasts of Gartner are a testament to it.

"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data" (Wikipedia).

The typical infrastructure we have to take into account is like a hyper grown forest: We have multiple data sources, typically reaching from data warehouses over RDBMS systems, pretty closed data sources like financial transaction stores, customer bank data, credit scores etc. The sources are mostly not the most modern, sometimes don’t even have connection points - like DWH systems, which are typically run with 90+% utilization.

Here comes Blossom into the game. With Blossom we can connect to each of those systems (if desired and needed), and we can use already available data processing frameworks and engines like Spark, Kafka or Flink (and their commercial counterparts) without blowing up the engineering team.

Now the fun part with Flower: we plug Flower to Blossom, and voilà - problem solved! The architecture could look like:

To connect Blossom with Flower we just need a few lines of code:

import blossom as bls

import flwr as fl

import tensorflow as tf

context = bls.context(env="federated")

transactions = context.read("url to transaction") \

.filter( transactionFilter )

input_flower = context.read("url to customer table") \

.filter( customerFilter ) \

.join (transactions ) \

.map ( convertToVector ) \

.toNumpy()

context.runFlower(

input_flower, \

server=fl.server.start_server("0.0.0.0:8080", config={"num_rounds": 3}) \

client=fl.client.start_numpy_client("0.0.0.0:8080", client=FlowerImplementedClient())

flowerEngine=tf

)

We call this stack Combined NLP Federated Data Services. Flower takes care of the chatbot communication, the ML model and the execution with TF (Tensorflow) or any other supported ML framework, delivers the outcome to Blossom. Blossom now takes care of enriching the model with information from deeper backend systems and gives the output back to Flower, and Flower takes care of the next iteration with TensorFlow (TF).

This architecture is the backbone for an extensive NLP system using the best tools available for Federated Learning. This stack is future proof, both frameworks are built with pluggable extension support from the beginning. That means: whatever comes in the future, that stack can handle it. Even quantum computing AI training will be easily adoptable as a plugin.

Conclusion:
To build cutting edge AI and machine learning / NLP stacks is not an area only the biggest data companies in the world can handle. With this approach we guarantee data sustainability, unmatched data privacy and enable digital transformation on a completely new level.

[1] https://cacm.acm.org/magazines/2020/12/248796-federated-learning-for-privacy-preserving-ai/fulltext
[2] https://wayang.apache.org/documentation.html
[3] https://github.com/adap/flower

[4] https://flower.dev/docs/implementing-strategies.htm

*** This post was originally published in our databloom.ai blog ***

Compile Apache Wayang on Mac M1

2021-09-29T15:08:00.054+02:00

We release Apache Wayang v0.6.0 in the next days, and during the release testing I was wondering if we get wayang on M1 (ARM) running. And yes, a few small changes - voila!

Install maven, scala, sqlite and groovy:
brew install maven scala groovy sqlite

Download openJDK 8 for M1:
https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk and install the pkg.

Get Apache Wayang either from https://dist.apache.org/repos/dist/dev/wayang/, or git-clone directly:

git clone https://github.com/apache/incubator-wayang.git

Start the build process:

cd incubator-wayang
export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home

mvn clean install

Ready to go:

[INFO] Reactor Summary for Apache Wayang 0.6.0-SNAPSHOT:

...

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 06:24 min

After the build is done the binaries are located in mavens home:
~/.m2/repository/org/apache/wayang

Edit: After we pushed https://github.com/apache/incubator-wayang/commit/b8afb0c8c32f1bd6bd9b9f29416a697826be4f79#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8 Apache Wayang works with JDK11 and Windows! Woot!

Stream IoT data to S3 - the simple way

2020-12-27T18:04:00.004+01:00

First, a short introduction to infinimesh, an Internet of Things (IoT) platform which runs completely in Kubernetes:

infinimesh enables the seamless integration of the entire IoT ecosystem independently from any cloud technology or provider. infinimesh easily manages millions of devices in a compliant, secure, scalable and cost-efficient way without vendor lock-ins.
We released some plugins over the last weeks - a task we had on our roadmap for a while. Here is what we have so far:

Elastic
Connect infinimesh IoT seamless into Elastic.
Timeseries
Redis-timeseries with Grafana for Time Series Analysis and rapid prototyping, can be used in production when configured as a Redis cluster and ready to be hosted via Redis-Cloud.
SAP Hana
All code to connect infinimesh IoT Platform to any SAP Hana instance
Snowflake
All code to connect infinimesh IoT Platform to any Snowflake instance.
Cloud Connect
All code to connect infinimesh IoT Platform to Public Cloud Provider AWS, GCP and Azure. This plugin enables customers to use their own cloud infrastructure and extend infinimesh to other services, like Scalytics, using their own cloud native data pipelines and integration tools.

We have chosen Docker as main technology, because it enables our customers to run their own plugins in their own space in their controlled environment. And, since our plugins don't consume so much resources, they fit perfectly into the free tiers of AWS EC2 - I use them here in that blog post.
The plugin repository was structured with developer friendliness in mind. All code is written in Go, and the configuration will be done on dockerfiles. Since you need to put credentials into, we highly advise to run the containers in a controlled and secure environment.

Stream IoT data to S3

Here I like to show how easy it is to combine IoT with already installed infrastructures in public clouds. The most used task we figured is the data stream to S3; most of our customers use S3 either directly with AWS, or by implementing their own object storage using the S3 protocol, like MinIO - which is also Kubernetes native.

Of course a private installation of infinimesh or accounts on infinimesh.cloud and AWS are needed, if using the cloud version of both. Here is a screenshot from the SMA device I used to write this post:

Preparation

Spin up an EC2 instance in the free tier with Linux, a t2.micro instance should fit mostly all needs
Log into the VM and install docker as described in the AWS documentation: Docker basics for Amazon ECS - Amazon Elastic Container Service
Install docker-compose and git:

sudo curl -L \
https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)\
-o /usr/local/bin/docker-compose \
&& sudo chmod +x /usr/local/bin/docker-compose \
&& sudo yum install git -y

That’s all we need as preparation, now log-off and login again to enable the permissions we have set earlier.

Setup and Run

Clone the plugin - repo:
git clone https://github.com/infinimesh/plugins.git
Edit the CloudConnect/docker-compose.yml and replace CHANGEME with your credentials
Compose and start the connector (-d detaches from the console and let the containers run in background):
docker-compose -f CloudConnect/docker-compose.yml --project-directory . up --build -d
Check the container logs:
docker logs plugins_csvwriter_1 -f

We used Go as development language, therefore the resource consumption is low:

After one minute the first CSV file should be arriving in S3. That’s all - easy and straightforward.

Some developer internals

We have built some magic around to make the use of our plugins as easy as possible for customers and at the same time easy to adapt for developers.

How it works:

First we iterate over /objects, finding all endpoints marked with [device], call the API for each device and store the data as a sliding window into a local redis store, to buffer network latency. After some seconds we send the captured data as CSV to the desired endpoints. In our tests we transported data from up to 2 Million IoT devices over this plugin, each of those devices send every 15 seconds ten key:value pairs as JSON.

Embedded Linux won't reboot - how to fix and repair

2020-04-16T08:44:00.000+02:00

I have a lot of embedded systems running in our lab or in my home, all of them either as Raspberries or selfmade PCB with Yocto. Sometimes I can't reboot some systems, I think its the journald which causes some issues with SSD Cards, the error-message usually is:
Failed to open /dev/initctl

Anyhow, if you have this issue - a reboot can be force-forced:
systemctl --force --force reboot

Since a forced reboot does not sync the journal, the system typically comes up with a damaged FS. The remote fsck can be initiated by extending the command above with

sudo tune2fs -i 1m /dev/DISK && touch /forcefsck && systemctl --force --force reboot

(assumed you have access to a shell, via SSH or local access). When all goes fine, the system comes up with a clean FS. All this fuss comes from the SSD r/w actions, a well designed IoT embedded devices should have a flash mem part for the running OS.

How to fix "Problem with MergeList /var/lib/apt/lists/"

2020-03-30T16:24:00.001+02:00

Sometimes, and more often, an update on RaspberryPi hangs with: "Problem with MergeList ..."

sudo mv /var/lib/apt/extended_states /var/lib/apt/extended_states_tmp && rm -rf /var/lib/apt/lists/* -vf && sudo apt-get update

solves the issue.

Kubernetes - delete all pods with a certain status

2020-03-08T09:47:00.003+01:00

"How to delete all pods with a status like Error" - that is a question I often get. The DevOps answer is a one-liner, filtering by status.reason and the status we want to query, es example:

kubectl get po -A --all-namespaces -o json | jq '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | "kubectl delete po \(.metadata.name) -n \(.metadata.namespace)"' | xargs -n 1 bash -c

In that case we delete all pods with the Status "Evicted". But it works for every status based filtering.

Dell XP 13 7390 late 2019 and Linux

2019-11-30T13:28:00.000+01:00

Since my 3 years old Macbook Pro 15inch got the flickering bug I decided to buy a Dell laptop. Why? I have a really old Dell laptop from 2004! which works perfectly as a OpenBSD firewall and internet router in my weekend house.

After I got my machine, first thing was to install PopOS. Or better, I tried. It never worked, I think it's a bug in the loader section of PopOS, a bugreport is open. After some hours of hacking and fixing I got a bit bored and used a Ubuntu installation, which worked out of the box perfectly. Yes I know I had could bought the XPS at Dell with Ubuntu on it. But I did not want to ;)

Here now some tricks to get the XPS smooth to work with - right now I have the book attached to my curved 4k monitor and code some infrastructure stuff for infinimesh.

1. Bluetooth mouse
First time a BT mouse is connected it lags. Moving the mouse, the pointer follows after seconds, slow and unresponsive. Fix by sudo vi /var/lib/bluetooth/<MAC Adapter>/<MAC mouse>:

[ConnectionParameters]
MinInterval=6
MaxInterval=9
Latency=44
Timeout=216

Reboot and the mouse works like a charm

2. Power Saving
I'm used to close the laptop lid and the systems get into sleep. Apple perfectionized that. Per default, closing the lid just switched the display off, which is not a powersave mode. To have a more proper powermanagement install tlp per: sudo apt-get install tlp. The default rules are pretty fine, but can be tweaked per vi /etc/default/tlp

3. UI
Ya, what to say. I don't like Ubuntu's visual artwork. Thats why I wanted to install PopOS. But my friends from System76, the maker of PopOS, have a nice blogpost how to get the most of PopOS into a vanilla Ubuntu.

4. Software

I tested a lot of mail clients, calendars and other collaboration tools and ended up, I'd say as usual, with Evolution. Works well with Google, Outlook, NextCloud calendars and mail systems like mail-in-a-box. To get an unified inbox in Evolution just create a search folder, name it Unified Inbox and search for read and unread messages in all accounts. Voila, unified inbox.

I also use Visual Studio Code, Slack, WhatsApp Desk, Solaar and oh-my-zsh of course. For PDF signing I use Xournal - its not so easy as with OSX Preview.app, but works. Gnome's Evince will have some support in future, too.

To an end, the XPS is a more capable MacBook as the original from Apple. With Linux and some tweaks the systems runs more stable and smooth as my 243% more expensive MBP, has a modern look and feel and all the tools I'm used too.

Kubernetes - on which node runs what?

2019-11-07T12:11:00.004+01:00

Just had to find the node for a dedicated pod, found this line useful:

kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName --all-namespaces

OSX - iCloud accounts broken after update to Catalina

2019-10-21T17:46:00.001+02:00

Yay - that was happen to me. Magically I had two iCloud accounts, both not working - and no way to delete one of them via Settings -> Internet Accounts. After a few hours of googling, debugging I got it working again. And as usual I'd like to share that solution, I think I'm not the only one with that problem ;)
Be aware, that's a bomb solution, means it will reset all internet accounts - but also unfreeze all the broken stuff:

sudo -v ; killall -9 accountsd com.apple.iCloudHelper ; defaults delete MobileMeAccounts ; mkdir ~/Library/Accounts/Backup ; mv ~/Library/Accounts/*.sqlite* ~/Library/Accounts/Backup/ ; killall -9 accountsd com.apple.iCloudHelper ; sudo reboot

So, why does that happen? Apple uses sqlite as backend for storing the informations, and sometimes this single instance get corrupted. Have fun!

Enabling IoT to establish a sustainable value chain

2019-04-29T19:31:00.000+02:00

I wrote an article for CIO Applications, here's archive of it:

IoT devices are getting more and more intelligent and can now create meshed networks by itself, switching from a sensor into an actor and transferring informations only for the meshed neighbors. For example a connected car could tell the future home that the homeowner will be at home in 5 minutes and the garage door and the door need to be unlocked in time, the lights need to be switched on and the grid operator needs to be informed that the wallbox now charges with 22KW. In near future this will happen over direct meshed information cells, operated by always connected devices, wearables, sensors, actors, mobile devices - short: everything. And all cloud provider offer dozens of solution to master the challenges, on the one, other or complete different way.

Self-organizing mesh networking and communication comes with a permanent flow of information, massive IoT data streams even classic Big Data frameworks like Hadoop cannot handle anymore in time. Coming along with the art of data, the need for data processing changes with the kind of data creation and ingestion. Most analyses will be done on the edge and during the ingestion stream when the data comes to rest. The data lake should be the central core to store data, but the data needs to get categorized and catalogued together with a proper and well defined schema and data description. The intended use of the gravity such data pools generate needs to be applied as the motor of data driven innovation.

Why? Batched processing helps to predict getting value out of stored data even by analyzing multiple other data points and storage facilities, but not to react in time. And timely information in IoT enables business processes only to have a valuable meaning at the time they occur, to do the job stream processing frameworks like Spark or Kafka are more suitable. Combining both techniques brings unmatched value and impact to the business, driven by the right use of data. Stream processing during the data transportation closes the gap between rapid data and data on rest. Mostly referring to the more costly IoT at edge computing, MQTT enabled stream processing engines deliver high throughput over all kind of compute instances, be it in a local data center, hybrid clouds or in public clouds.

The same is countable for available cloud technology. Every cloud provider has his own IoT solution zoo with his own lock-ins, but often they do not fit to scaling plans either in complexity, missing or not well implemented parts or simply the price model is not comparable to the margin getting from an IoT based product. A combined approach of scalable cloud technology (which fits most) and own development brings the most benefit at an affordable price tag, unspoken of the intellectual property a business gains and holds, instead to bring this to providers and therefore competitors. Independent organisations like “Linux Foundation Edge” provide the most useful insight over Open Source projects and initiatives.

Just dumping data somehow without visions behind does not help to solve the problems companies face on their digital journey, especially when it comes to questions of revenue from IoT projects. Big Data needs to have a nearly perfect data management, data rights and data retention process behind. Only this offers the possibilities to get full advantage of any kind of data, to open new revenues and sales streams and to finally see all data driven activity not as a cost saving project (as the most agencies and vendors promise) but as a revenue creation project. Using modern cloud technologies moves organizations into the data centric world, focusing on business and not operations.

Analyzing the data is the more tricky part here - on the one hand every data point brings valuable input, but on the other hand the unlimited data store also brings vulnerabilities to customer insights. I am a bit concerned about 360 degrees approaches. At first the value part of data collections needs to be questioned: which data is system relevant for support, maintenance or emergency and which is important to generate a sustainable revenue. Using streaming analysis gives valuable input at the point in time the information is needed to make decisions, but also gives the possibility to route data into different data stores. It is always unquestionable that the value of customers is higher than the data gathered, implementing a state-of-the-art data ethic catalogue is one of the main tasks analytics needs to cover.

We move quickly to a so-called interconnected world, always connected systems will dominate our future lives, introducing new business models by combining business areas which were not even in the range of combined business models. The future CIO needs to know what implications the data has, what uncountable values this data can generate but also to weight what threats uncontrollable data collections can cause. Building new data driven business will be the most exciting job in future, things never done before are now possible. Embrace this.

The article can be read online:

https://cloud.cioapplicationseurope.com/cxoinsights/enabling-iot-to-establish-a-sustainable-value-chain-nid-885.html

Infinimesh IoT / IIoT platform is starting up!

2019-03-14T19:32:00.000+01:00

Today is a day we will never forget - infinimesh (https://www.infinimesh.io/) is starting and lifting off! Our Kubernetes, Apache Kafka ® and graph based Industrial IoT platform is entering the alpha stage! We have been working like maniacs over the past 14 months to bring a fully flexible, independent, patent and vendor lock-in free IoT platform to you! Soon it’s your chance to test and try it out, our closed alpha will be open for public on March 30, 2019 - Mark this date in your calendar!

An incredible platform comes to life

We believe smart and connected devices bring our society forward. Smart technology uses resources only when they are really necessary and thus prevents waste. On the other hand, when really required, smart things act and hence prevent accidents or simply enable a great user experience. We have started infinimesh 100% Open Source, without patents or closed software. Any software components we have developed, and to this we commit going forward, will be open - forever. Founded by engineers who built the backbone of the European Energy Revolution, infinimesh aims to make industrial and individual IoT secure, available and affordable for all. Infinimesh runs in all cloud offerings, be it public, hybrid or private. All you need is Linux; our platform works in any container environment as well as native.

Infinimesh on Google Cloud

We have selected Google Cloud as strategic partner for our SaaS offering - and from today on the platform is running on GCP! Our SaaS offering, running in Google Cloud, is free for everybody up to 25 devices - ideally for makers, startups and industrial Proof of Concepts. That leaves enough room to bring ideas to live and test even larger installations and use the feature rich ecosystem of GCP to make your idea a successful product.

What can I do with infimesh IoT on GCP right now?

Connect devices securely via MQTT 3.1.1
Transfer desired and reported device states
Manage accounts (Create/Delete)
Manage Namespaces to organize devices and restrict access to devices
Create hierarchically organized objects, e.g. buildings, rooms to organize and model device hierarchies

How does it work?

Our Kubernetes Operator does the work a real operator would do: it not only installs the whole platform, but also takes care of required cloud/datacenter resources, updates, monitoring and handles incidents like errors. It attempts to resolve as many issues as possible on its own, and notifies human operators when human intervention is required. The operator is the glue between infinimesh and the target installation environment. Our alpha drop focuses on Google Cloud Platform and enables exactly this environment. More supported environments will follow.

We build features for industrial IoT

Device Management

Powerful but simple framework to visualize clusters of devices within your organization and set permissions up to device level.

Device Shadow

Real-time and two-way correspondence for every device in your fleet. Our highly-scalable backend can power millions of devices.

Timeseries Visualization

Great telemetry is based on timeseries. infinimesh has timeseries data capabilities built-in and enables meaningful monitoring.

Virtual Twins

A virtual twin is the digital copy of your physical asset. infinimesh provides virtual twins which give you the possibility to modify your physical device without even touching it
Machine Learning and AI.

Intelligence

infinimesh has Machine Learning and Artificial Intelligence models built-in to rapidly detect anomalies and respond accordingly.

Roadmap and features ahead

OPC-UA with full open62541 support (binary protocol with encryption) and BACnet will be available within the next quarter.

OPC-UA is the leading semantic protocol for industry 4.0 and opens the full potential to industry proven stacks like Siemens MindSphere and IBM Watson for Industry. BACnet will also make its way into the platform quite soon, we expect a first drop in the next couple of weeks. BACnet is the most used communications protocol for Building Automation and Control (BAC) networks that leverage the ASHRAE, ANSI and ISO 16484-5 standard protocol and is used in various intelligent buildings as protocol stack.

What’s next?

More exciting news and announcements will follow in the next months, so use the platform and follow this blog or our channels to never miss news. We are happy to have you as user and customer and we will support you in any idea you have. Drop us a mail, or open a Feature Request (https://github.com/infinimesh/infinimesh/blob/master/.github/ISSUE_TEMPLATE/feature_request.md) or contact (https://infinimesh.io/contact.html#contact) us over our different channels - we are here.

Next Internet comes with IoT

2017-11-22T13:48:00.000+01:00

The Internet we know is a great space for collaboration, social media and gaming. But when it comes to business or transactions, the power belongs to few big ones. Remember the S3 outage and half of the north-american services where offline? Or the Dny hack which kicked out half of the internet for hours? The next internet could be a blockchain based independent network, using as many protocols as available and there is no one person in control of it and it is run on the Internet.

In a nutshell, Blockchain is a decentralized system in which every transaction gets mathematically approved by the members of the system, therefore every member of that transaction knows about it. The information of the transaction is stored in the distributed servers of the blockchain. That makes manipulations highly impossible, and the transaction is also highly available at every time.

IoT devices are getting more and more intelligent and can now create meshed networks by itself, switching from a sensor into an actor and transferring informations only for the neighbors. For example to tell the doorknob that the Homeowner will be at home in 5 minutes with his EV, and the Wallbox and the door needs to be unlocked. Right now that is possible by IFTTT, which is an extra protocol and needs manual configuration, in future this will happen over direct meshed information cells automatically, inclusive status updates.

When we now look into the power of billions of IoT devices, may it be sensors, cameras, windmills, cars or whatever, as basis they all carry CPU and memory. Connecting all of those together combines a large, highly available inter-connected system. Always on, always accessible, always responsible self connected things which share informations about their environment with other things by itself and trigger automated actions, learned by the behavior of the things’ environmental space. Thinking as an ultrawide available blockchain, those devices will be the next internet. Transactions, informations and data will be stored securely on a device and every device, connected to another device, will automatically become a member of the global blockchain pool in the future. That brings the power of blockchain to an always connected network, speeding up the digital disruption every business has and allows enterprises to build models based on the decentralized network. Right now, without an economic virtual entity to establish each other's identity, over 2 billion of humans are excluded from being a part of any financial transaction globally and let others collect data about yourself, steal identities and commit fraud without letting us a chance to fight against. Those who have the power and control large parts if the Net can’t be disempowered, because they operate large parts of the Net, too.

That mistake can and will be solved by the next Internet, bringing in radical and new solutions for the Internet we know. Most of them are based on Blockchain Technology, like Ethereum provides for Smart Contracts.

Another technology move could be blockchain powered AI, immutable, shared decentral control, trusted audit trails leads to qualitative better data and algorithms through more data available. Since real-world modeling works on large volumes of data, such as training on large datasets or high-throughput stream processing systems. For applications of blockchain to AI, blockchain technology with big-data scalability and querying like the groundbreaking BigchainDB with the public IPDB are needed. And a global scaled blockchain unlocks new large-scale opportunities starting from better model training though model sharing over a shared global AI model registry to automated wealth for our planet.

The Machine and BigData

2017-06-16T09:49:00.005+02:00

HP’s „The Machine“ (1) project is in my eyes the most advanced in the IT world with the simple goal to rethink the entire computer design. And the plan is ambitious – the first edge devices shall be ready in 2018, industrialized series in 2020.

Will “The Machine” really revolutionize an entire industry mostly influenced by IBM? Let’s say it could and probably will with a high percentage of success.
Based on the idea of Memristor (2) the project uses memory based technology to store data. Nothing new here. New is the non-volatile usage. Data, stored in an Memristor, persists unless the storing bit gets cleaned and new aligned. Now, NVRRAM (non-volatile resistive RAM) it’s faster as volatile DDR4 modules (which they use at the moment until Western Digital can deliver NVRRAM modules) and factor 100x faster than current state-of-the-art SSD based technologies. The newest prototype has 40 nodes with approx. 160 TB DDR4-RAM and 1,280 Cores connected with X1 PM’s (Photonic Modules). Means: pretty fast. Anyhow, just follow the appendix (1) to get more interesting engineering facts.
The most important consideration is the pure permanent all-integrated storage itself. The part of attached storage (like HDFS, GFS, Ceph) would simply disappear and directly merge with the computation layer. The principle “local data first” will surely be a part of any fine-tuning approach but with the high density of storage that will not really matter. All pieces of computation will be at the same place (cache, volatile and permanent storage combined with fast caching) and work as one homogenous entity which can hold every state of every piece of data during the whole computation lifecycle.
I just want to consider the changing fundamentals of that idea and what that would mean to data processing. The first big difference – a trinity memristor can store 10 bits instead of 8 today. That means simply a 3 times higher data storage density than today. Additionally, the highly volatile cache a CPU uses during the calculation process will be stored permanently which allows following processes to reuse the pre-calculated subsets and that would speed up any calculation dramatically. As for example in pattern detection algorithms like MCMC (3) could highly benefit simply by picking up the already calculated subset and use it in a new chain which would revolutionize data intelligence in terms of speed and tree generation. I think thats an huge step into the AI world - ultrafast learning algorithms helping the mankind to operate high sensitive environments like deep- space flights, connected cars, CEP networks or decentral power grids.

(1) https://www.labs.hpe.com/the-machine
(2) http://en.wikipedia.org/wiki/Memristor
(3) https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

The next stage of BigData

2017-05-09T20:19:00.002+02:00

Right now, the terms BigData and Hadoop are used as one and the same - often like the buzzword of buzzwords. And they sound mostly as a last time call, often made by agencies to convince people to start the Hadoop journey before the train leaves the station. Don’t fall into that trap.

Hadoop was made by people who worked in the early internet industry, namely Yahoo. They crawled millions of millions web pages every day, but had no system to really get benefit from this information. Dug Cutting created Hadoop, a Map/Reduce framework written in Java and blueprinted by Google in 2004 (1). The main purpose was to work effectively with an ultra-large set of data and group them by topics (just to simplify).
Hadoop is now 10 years old. And in these 10 years the gravity of data management, wrangling and analyzing runs faster and faster. New approaches, tools and techniques emerging every day in the brain centered areas called Something-Valley. All of those targeting the way we work and think with data.

That describes the main problem of Hadoop itself – it’s designed as an inner working system, providing storage and computation layer at once. And that’s why Hadoop Distributions typically are suggesting to use BareMetal installations in a Datacenter and push companies to create the next silo'd world, promising the good end after leave another one (separate DWH’s without connection between each other). That comes with dramatic costs, operations and a workforce of highly trained engineers, among high costs of connecting systems on premise to the new silo'd DataLake approach, often mixed up with lift-and-shift operations. And here arises the next big problem described as “data gravity”. Data simply sinks down the lake until nobody can even remember what kind of data that was and how the analytical part can be done. And here the Hadoop journey mostly ends. A third issue comes up, driven by agencies to convince companies to invest into Hadoop and Hardware. The talent war. In the end it simply creates the next closed world, but now named a bit fancier.

The world spins further, right now in the direction public cloud, but targeting device edge computing (IoT) and DCC (DataCenter on a chip). Additionally, the kind of data changes dramatically from large chunks of data (PB on stored files from archives, crawler, logfiles) to streamed data delivered by millions of millions edge computing devices. Just dumping data in a lake without visions behind getting cheap storage doesn’t help to solve the problems companies face in their digital journey.

Coming along with the art of data, the need for data analyzing changes with the kind of data creation and ingestion. The first analysis will be done on the edge, the second during the ingestion stream and the next one(s) when the data comes to rest. The DataLake is the central core and will be the final endpoint to store data, but the data needs to get categorized and catalogued during the stream analytics and stored with a schema and data description. The key point in a so-called Zeta-Architecture is the independence of each tool, the “slice it down” approach. The fundamental basic is the data centered business around a data lake, but the choice of tools getting data to the lake, analyze and visualize them aren’t written in stone and independent from the central core.

That opens the possibilities to really get advantage of any kind of data, to open new revenues and sales streams and to finally see all data driven activity not as a cost saving project (as the most agencies and vendors promise) but as a revenue creation project. Using modern cloud technologies moves organizations into the data centric world, focusing on business and not operations.

(1) https://research.google.com/archive/mapreduce.html

Hue 3.11 with HDP 2.5

2016-12-08T09:30:00.004+01:00

Works fine with CentOS / RHEL, I used 6.8 in that case. Epel has to be available, if not, install the repo.
And I ask me why Hortonworks didn't integrated Hue v3 in their HDP release - I mean, Hue v2 is older as old and lacks dramatically on functionality.
Anyhow, lets get to work.

sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo

sudo yum install ant gcc krb5-devel mysql mysql-devel openssl-devel cyrus-sasl-devel cyrus-sasl-gssapi sqlite-devel libtidy libxml2-devel libxslt-devel openldap-devel python-devel python-simplejson python-setuptools rsync gcc-c++ saslwrapper-devel libffi-devel gmp-devel apache-maven

sudo mkdir /software; sudo chown hue: /software && cd /software
wget https://github.com/cloudera/hue/archive/master.zip -O hue.zip && unzip hue.zip; cd hue-master; sudo mkdir -p /usr/local/hue && chown -R hue: /usr/local/hue && make install

HDP config changes:

Oozie => Custom oozie-site

oozie.service.ProxyUserService.proxyuser.hue.groups *
oozie.service.ProxyUserService.proxyuser.hue.hosts *

Hive => Custom webhcat-site
webhcat.proxyuser.hue.host *
webhcat.proxyuser.hue.groups *

HDFS => Custom core-site
hadoop.proxyuser.hue.hosts *
hadoop.proxyuser.hue.groups *

At the end, hue.ini needs to be configured to fit the installation, here's an example - I use 8899 as HueUI port:

http_port=8899
app_blacklist=impala,security
hive_server_host=hue3.azure.intern
hbase_clusters=(Cluster|hue3.azure.intern:9090)
solr_url=http://hue3.azure.intern:8983/solr/

At least a new security rule for port 8899 has to be created, as well as the hbase thrift service has to be started per:
nohup hbase thrift start &

Configure Hue:

/usr/local/hue/build/env/bin/hue syncdb

/usr/local/hue/build/env/bin/hue migrate

Start Hue:

/usr/local/hue/build/env/bin/supervisor -d

I strongly recommend to use MySQL as an backend DB, but for first test the integrated SQLite instance is fine, too.

Issues:
HUE-4701 - recreate the saved queries from sample notebook

Erase HDP 2.x and Ambari

2016-11-29T10:40:00.001+01:00

Since I hack now often with Hortonworks HDP, I also often need to completely clean out my lab environments to get fresh boxes. I figured to write a ugly shell script is more comfortable as bothering my infra guys to reset the VM's in Azure - which also reset all my modifications. Bad!
Anyhow, here's the script in the case anyone has some use, too.

https://github.com/alo-alt/shell/blob/master/rmhdp.bash

As usual, first stop all Ambari managed services. I remove Postgres too, since the setup of a new db done by the installer of Ambari is much more faster than dealing with inconsistencies later.
Side Note: The script is made for RHEL based distributions ;)

FreeIPA and Hadoop Distributions (HDP / CDH)

2016-10-24T10:11:00.000+02:00

FreeIPA is the tool of choice when it comes to implement a security architecture from the scratch today. I don't need to praise the advantages of FreeIPA, it speaks for himself. It's the Swiss knife of user authentication, authorization and compliance.

To implement FreeIPA into Hadoop distributions like Hortonwork's HDP and Cloudera's CDH some tweaks are necessary, but the outcome is it worth. I assume that the FreeIPA server setup is done and the client tools are distributed. If not, the guide from Hortonworks has those steps included, too.

For Hortonworks, nothing more as the link to the documentation is necessary:
https://community.hortonworks.com/articles/59645/ambari-24-kerberos-with-freeipa.html

Ambari 2.4x has FreeIPA (Ambari-6432) support (experimental, but it works as promised) included. The setup and rollout is pretty simple and runs smoothly per Wizard.

For Cloudera it takes a bit more handwork, but it works at the end also perfect and well integrated, but not at the same UI level as Ambari. These steps are necessary to get Cloudera Manager working with FreeIPA:

1. create the CM principal in FreeIPA (example: cdh@ALO.ALT)
2. retrieve the keytab:

ipa-getkeytab -r -s freeipa.alo.alt -p cdh -k cdh.keytab
3. install ipa-admintools on the Cloudera Manager server

yum install ipa-admintools -y
4. place the retrieval-script (from my GitHub) in /opt/cloudera/security/getkeytabs.sh (or another path accessible by cloudera manager), make it executable and owned by cloudera-scm

chmod 775 /opt/cloudera/security/getkeytabs.sh && chown cloudera-scm: /opt/cloudera/security/getkeytabs.sh
5. Start the Kerberos wizard, but stop after verifying the cdh user
6. Set the configuration [1] for "Custom Kerberos Keytab Retrieval Script" to "/opt/cloudera/security/getkeytabs.sh"
7. resume the Kerberos wizard and follow the steps until its finished and restart the cluster.

Important:

The FreeIPA client from RHEL7 / CentOS 7 uses now memory based keytabs, but Java doesn't support them (yet). To switch back to the file based ticket cache, the config file (/etc/krb5.conf) needs to be altered by commenting default_ccache_name out, which let the client use the default file based ticket cache;

cat /etc/krb5.conf

# default_ccache_name = KEYRING:persistent:%{uid}
..

[1] http://www.cloudera.com/documentation/enterprise/latest/topics/sg_keytab_retrieval_script.html

Shifting paradigms in the world of BigData

2016-10-12T08:18:00.001+02:00

In building the next generation of applications, companies and stakeholders need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of that changes.

Event Driven vs. CRUD
Software development traditionally is driven by entity-relation modeling and CRUD operations on that data. The modern world isn’t about data at rest, it’s about being responsive to events in flight. This doesn’t mean that you don’t have data at rest, but that this data shouldn’t be organized in silos.
The traditional CRUD model is neither expressive nor responsive, given by the amount of uncountable available data sources. Since all data is structured somehow, an RDBMS isn't able to store and work with data when the schema isn't known (schema on write). That makes the use of additional free available data more like an adventure than a valid business model, given that the schema isn't known and can change rapidly. Event driven approaches are much more dynamical, open and make the data valuable for other processes and applications. The view to the data is defined by the use of the data (schema on read). This views can be created manually (Data Scientist), automatically (Hive and Avro for example) or explorative (R, AI, NNW).

Centralized vs Siloed Data Stores
BigData projects often fail by not using a centralized data store, often refereed as Data Lake or Data Hub. It’s essential to understand the idea of a Data Lake and the need for it. Siloed solutions (aka data warehouse solutions) have only data which match the schema and nothing else. Every schema is different, and often it’s impossible to use them in new analytic applications. In a Data Lake the data is stored as it is - originally, untouched, uncleaned, disaggregated. That makes the entry (or low hanging fruit) mostly easy - just start to catch all data you can get. Offload RDBMS and DWs to your Hadoop cluster and start the journey by playing with that data, even by using 3rd party tools instead to develop own tailored apps. Even when this data comes from different DWH's, mining and correlating them often brings treasures to light.

Scaled vs. Monolith Development
Custom processing at scale involves tailored algorithms, be they custom Hadoop jobs, in-memory approaches for matching and augmentation or 3rd party applications. Hadoop is nothing more (or less) than a framework which allows the user to work within a distributed system, splitting workloads into smaller tasks and let those tasks run on different nodes. The interface to that system are reusable API's and Libraries. That makes the use of Hadoop so convenient - the user doesn't need to take care about the distribution of tasks nor to know exactly how the framework works. Additionally, every piece of written code can be reused by others without having large code depts.
On the other hand Hadoop gives the user an interface to configure the framework to match the application needs dynamically on runtime, instead of having static configurations like traditional processing systems.

Having this principles in mind by planning and architecting new applications, based on Hadoop or similar technologies doesn’t guarantee success, but it lowers the risk to get lost. Worth to note that every success has had many failures before. Not trying to create something new is the biggest mistake we can made, and will result sooner or later in a total loss.

Cloudera Manager and Slack

2016-09-15T08:46:00.003+02:00

The most of us are getting bored by receiving hundreds of monitoring emails every day. To master the flood, rules are getting in play - and with that rules the interest into email communication are reduced.
To master the internal information flood, business messaging networks like Slack are taking more and more place.

To make CM work with Slack a custom alert script from my Github will do the trick:

https://github.com/alo-alt/Slack/blob/master/cm2slack.py

The use is pretty straight forward - create a channel in Slack, enable Webhooks, place the token into the script, store the script on your Cloudera Manager host, make it executable for cloudera-scm : and enable outgoing firewall / proxy rules to let the script chat with Slack's API. The script can handle proxy connections, too.

In Cloudera Manager, the script path needs to be added into Cloudera-Management-Service => Configuration => Alert Publisher => Custom Script.