Skip to main content

Shifting paradigms in the world of BigData

Listen:

In building the next generation of applications, companies and stakeholders need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of that changes.

Event Driven vs. CRUD

Software development traditionally is driven by entity-relation modeling and CRUD operations on that data. The modern world isn’t about data at rest, it’s about being responsive to events in flight. This doesn’t mean that you don’t have data at rest, but that this data shouldn’t be organized in silos.
The traditional CRUD model is neither expressive nor responsive, given by the amount of uncountable available data sources. Since all data is structured somehow, an RDBMS isn't able to store and work with data when the schema isn't known (schema on write). That makes the use of additional free available data more like an adventure than a valid business model, given that the schema isn't known and can change rapidly. Event driven approaches are much more dynamical, open and make the data valuable for other processes and applications. The view to the data is defined by the use of the data (schema on read). This views can be created manually (Data Scientist), automatically (Hive and Avro for example) or explorative (R, AI, NNW).

Centralized vs Siloed Data Stores

BigData projects often fail by not using a centralized data store, often refereed as Data Lake or Data Hub. It’s essential to understand the idea of a Data Lake and the need for it. Siloed solutions (aka data warehouse solutions) have only data which match the schema and nothing else. Every schema is different, and often it’s impossible to use them in new analytic applications. In a Data Lake the data is stored as it is - originally, untouched, uncleaned, disaggregated. That makes the entry (or low hanging fruit) mostly easy - just start to catch all data you can get. Offload RDBMS and DWs to your Hadoop cluster and start the journey by playing with that data, even by using 3rd party tools instead to develop own tailored apps. Even when this data comes from different DWH's, mining and correlating them often brings treasures to light.

Scaled vs. Monolith Development

Custom processing at scale involves tailored algorithms, be they custom Hadoop jobs, in-memory approaches for matching and augmentation or 3rd party applications. Hadoop is nothing more (or less) than a framework which allows the user to work within a distributed system, splitting workloads into smaller tasks and let those tasks run on different nodes. The interface to that system are reusable API's and Libraries. That makes the use of Hadoop so convenient - the user doesn't need to take care about the distribution of tasks nor to know exactly how the framework works. Additionally, every piece of written code can be reused by others without having large code depts.

On the other hand Hadoop gives the user an interface to configure the framework to match the application needs dynamically on runtime, instead of having static configurations like traditional processing systems.

Having this principles in mind by planning and architecting new applications, based on Hadoop or similar technologies doesn’t guarantee success, but it lowers the risk to get lost. Worth to note that every success has had many failures before. Not trying to create something new is the biggest mistake we can made, and will result sooner or later in a total loss.

Comments

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at Log.read() at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also zkCli.sh can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

Life hacks for your startup with OpenAI and Bard prompts

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup?  For startups, reating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions. First Step: Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now