Data Engineering Weekly #33

Published in

Data Engineering Weekly

5 min readMar 22, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 33rd edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Michael Stonebraker’s Top 10 Big Data Blunders, Stanford University’s AI index report 2021, Maxime’s The future of the Business Intelligence is open source, Mehdi’s data engineering skills report, Apache Airflow survey 2020, DataMinded’s things to consider for Argo Workflow, Spotify’s new experimentation strategy, LightUp’s hidden data outages, Confluent’s real-time analytics with Kafka & Pinot, Pinterest’s Flink deployment framework, AWS’s new feature on Hudi, and Trino’s new window function enrichments.

Michael Stonebraker: Top 10 Big Data Blunders

Some of the recent articles and conversations around data modeling remind me of Michael Stonebraker’s talk about the top 10 big data mistakes. It is an excellent talk to watch/ re-watch.

Stanford University: The AI Index Report - Measuring trends in Artificial Intelligence

Stanford University published AI Index Report for 2021, focusing on AI development in the USA. It’s an exciting read, and the top 9 takeaways are,

“Drugs, Cancer, Molecular, Drug Discovery” received the greatest amount of private AI investment in 2020, with more than USD 13.8 billion, 4.5 times higher than 2019.
In 2019, 65% of graduating North American PhDs in AI went into the industry — up from 44.4% in 2010
AI systems can now compose text, audio, and images to a sufficiently high standard.
The diversity challenge — In 2019, 45% of new U.S. resident AI Ph.D. graduates were white — by comparison, 2.4% were African American, and 3.2% were Hispanic.
China overtakes the US in AI journal citations.
The majority of the US AI Ph.D. grads are from abroad — and they’re staying in the US.
Surveillance technologies are fast, cheap, and increasingly ubiquitous.
AI ethics lacks benchmarks, and consensus remains a challenge.
AI gained attention in congress: The 116th Congress is the most AI-focused congressional session in history. The number of mentions of AI in congressional record more than triple that of the 115th Congress.

https://aiindex.stanford.edu/report/

Maxime Beauchemin: The Future of Business Intelligence is Open Source

The open-source databases and data processing ecosystem revolutionized software development. The author raised an interesting question: When it comes to the BI platform, Why is it mostly closed source?

The Future of Business Intelligence is Open Source

While “software is [still actively] eating the world”, it’s also clear that open source is taking over software.

maximebeauchemin.medium.com

Mehdi Ouazza: What are the most requested technical skills in the data job market? Insights from 35k+ data jobs ads

It is an insightful hack to understand the skills in demand in data engineering. SQL & Python the top skill to develop if you’re into data science or data engineering. The author’s take on Python over Scala for data engineering resonates well with the Spark ecosystem’s current development.

What are the most requested technical skills in the data job market?Insights from 35k+ datajobs ads

Insights from the data skills radar, scanning daily data jobs ads

medium.datadriveninvestor.com

Apache Airflow: Airflow survey 2020

Apache Airflow published the 2020 Airflow survey result. Some of the exciting trends to highlight

13.79 adoption of the general developer community outside the data engineers.
85% of people using Airflow like/ very likely recommends Airflow.
Airflow local executor popular than the Kubernetes executor
Slack & Github is a go-to place for technical questions, 2X higher than StackOverflow!!

Airflow Survey 2020

World of data processing tools is growing steadily. Apache Airflow seems to be already considered as crucial component…

airflow.apache.org

DataMinded: What to consider before choosing Argo Workflow?

Argo Workflows is an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. The blog narrates the basic workflow using Argo and the pros and cons of Argo Workflow from the data engineering perspective.

What to consider before choosing Argo Workflow?

To go full Kubernetes-native or not?

medium.com

Spotify: Spotify’s New Experimentation Coordination Strategy

Spotify wrote about its new experimentation coordination strategy and migrated the experimentation platform to using Bucket Reuse for all experiments. The narration on handling exclusive and nonexclusive experiments and the concept of paths exciting to read.

Spotify's New Experimentation Coordination Strategy

At Spotify we run hundreds of experiments at any given time. Coordinating these experiments, i.e., making sure the…

engineering.atspotify.com

LightUp: Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems

LightUp writes an exciting two-part blog on the new category of the data outage termed “The Hidden Data Outages.”. The blog narrates some of the case studies where the hidden data outages cause significant business loss and the call for a dedicated data monitoring platform.

If you consider the hidden data outages on all the company's financial earnings call, Essentially, Our entire economy depends on all the untested SQLs!!!!

Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems? (Part 2)

Modern applications have a new dataplane where data can break independent of infrastructure health— creating a new…

blog.lightup.ai

Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems? (Part 1)

Business-critical data issues have been appearing everywhere over the last few years. What are they, and why are they…

blog.lightup.ai

Confluent: Under the Hood of Real-Time Analytics with Apache Kafka and Pinot

Apache Pinot is a distributed analytics data store rapidly becoming the go-to solution for building real-time analytical applications at scale. The blog narrates how the real-time ingestion from Kafka to Apache Pinot works and the internal implementation of mutable vs. immutable segments, query processing & memory management.

Real-Time Analytics with Apache Kafka and Pinot

Real-time analytics has become the need of the hour for modern internet companies. The ability to derive internal…

www.confluent.io

Pinterest: Pinterest Flink Deployment Framework

Pinterest writes about its Flink deployment framework and the integration with the CI/ CD pipeline. The blog narrates some of the best practices, such as job deduplication, state preservation before deploying a new version, and focusing on its reversibility.

Pinterest Flink Deployment Framework

Rainie Li | Software Engineer, Stream Processing Platform Team

medium.com

AWS: New features from Apache Hudi available in Amazon EMR

AWS highlighted the new feature improvements in the Apache Hudi available part of the AWS ecosystem. The ability to convert the existing parquet files to the Hudi format, seamless integration with AWS database migration services are some of the standout features. Redshift Spectrum’s ability to query the Apache Hudi dataset is an exciting trend to watch.

New features from Apache Hudi available in Amazon EMR | Amazon Web Services

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline…

aws.amazon.com

Trino: Introducing new window features

The SQL window function is a vital feature for analytics queries. Trino writes about its new improvements in supporting the window functions with the full support for the Range frame type, supporting the Group frame type, and adding the windowing as part of the WHERE clause.

Introducing new window features

In Trino, we are thrilled to get feedback and feature requests from our fantastic community, and we're tirelessly…

trino.io

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.