2pk03 over AI, ML, BigData and data processing

Posts

When to Choose ETL vs. ELT for Maximum Efficiency

By Alexander Alten - March 28, 2024

ETL has been the traditional approach, where data is extracted, transformed, and then loaded into the target database. ELT flips this process - extracting data and loading it directly into the system, before transforming it. While ETL has been the go-to for many years, ELT is emerging as the preferred choice for modern data pipelines. This is largely due to ELT's speed, scalability, and suitability for large, diverse datasets generated by multiple different tools and systems, think about CRM, ERP datasets, log files, edge computing or IoT. List goes on, of course.. Data Engineering Landscape Data engineering is the new kind of DevOps. With the exponential growth in data volume and sources, the need for efficient and scalable data pipelines and therefore data engineers has become the new standard . In the past, limitations in compute power, storage capacity, and network bandwidth made the famous 3-word "let's move data round" phrase Extract, Transform, Load (ETL) the

Life hacks for your startup with OpenAI and Bard prompts

By Alexander Alten - July 15, 2023

OpenAI and Bard are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup? Even creating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions. Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now you have laid a grea

Indexing PostgreSQL with Apache Solr

By Alexander Alten - July 12, 2023

Searching and filtering large IP address datasets within PostgreSQL can be challenging. Why? Databases excel at data storage and structured queries, but often struggle with full-text search and complex analysis. Apache Solr, a high-performance search engine built on top of Lucene, is designed to handle these tasks with remarkable speed and flexibility. What do we need? A running PostgreSQL database with a table containing IP address information (named "ip_loc" in our example). A basic installation of Apache Solr. Setting up Apache Solr Create a Solr Core: Bash solr create -c ip_data -d /path/to/solr/configsets/ Define the Schema ( schema.xml ) XML < field name = "start_ip" type = "ip" indexed = "true" stored = "true" /> < field name = "end_ip" type = "ip" indexed = "true" stored = "true" /> < field name = "iso2" type = "string" indexed = "true&q

Some fun with Apache Wayang and Spark / Tensorflow

By Alexander Alten - January 18, 2023

Apache Wayang is an open-source Federated Learning (FL) framework developed by the Apache Software Foundation. It provides a platform for distributed machine learning, with a focus on ease of use and flexibility. It supports multiple FL scenarios and provides a variety of tools and components for building FL systems. It also includes support for various communication protocols and data formats, as well as integration with other Apache projects such as Apache Kafka and Apache Pulsar for data streaming. The project aims to make it easier to develop and deploy machine learning models in decentralized environments. It's important to note that this are just examples and they may not be the way for your project to interact with Apache Wayang, you may need to check the documentation of the Apache Wayang project ( https://wayang.apache.org ) to see how to interact with it. I just point out how easy it is to use different languages to interact between Wayang and Spark. Also, you need to mak

Get Apache Wayang ready to test within 5 minutes

By Alexander Alten - September 22, 2022

Hey followers, I often get ask how to get Apache Wayang ( https://wayang.apache.org ) up and running without having a full big data processing system behind. We heard you, we built a full fledged docker container, called BDE (Blossom Development Environment), which is basically Wayang. Here's the repo: https://github.com/databloom-ai/BDE I made a short screencast how to get it running with Docker on OSX, and we also have made two hands-on videos to explain the first steps. Let's start with the basics - Docker. Get the whole platform with: docker pull ghcr.io/databloom-ai/bde:main At the end the Jupyter notebook address is shown, control-click on it (OS X); the browser should open and login you automatically: Voila - done. You have now a full working Wayang environment, we prepared three notebooks to make it more easy to dive into. Watch our development tutorial video (part 1) to get a better understanding what Wayang can do, and what not. Click the video below:

Combined Federated Data Services with Blossom and Flower

By Alexander Alten - July 27, 2022

When it comes to Federated Learning frameworks we typically find two leading open source projects - Apache Wayang [2] (maintained by databloom ) and Flower [3] (maintained by Adap ). And at the first view both frameworks seem to do the same. But, as usual, the 2nd view tells another story. How does Flower differ from Wayang? Flower is a federated learning system, written in Python and supports a large number of training and AI frameworks. The beauty of Flower is the strategy concept [4]; the data scientist can define which and how a dedicated framework is used. Flower delivers the model to the desired framework and watches the execution, gets the calculations back and starts the next cycle. That makes Federated Learning in Python easy, but also limits the use at the same time to platforms supported by Python. Flower has, as far as I could see, no data query optimizer; an optimizer understands the code and splits the model into smaller pieces to use multiple frameworks at the same ti

Search This Blog