The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 34th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s massively parallel graph computation, Uber’s data journey, Hyperight’s is data mesh right for your organization, Lyft’s ML feature infrastructure, Flyte joins LF Data & AI, PayPal’s secure data movement, Data pipeline @ Samsara, Gousto data teams’ best of 2020, Cloudflare’s anomaly detection, Instacart’s take on large-scale labeling, Dagster 0.11 release note, and why Kafka is fast.

Google: Massively Parallel Graph Computation — From Theory to Practice

Graph computation is widely used for various data science purposes, from ranking web pages by popularity and mapping out social…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 33rd edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Michael Stonebraker’s Top 10 Big Data Blunders, Stanford University’s AI index report 2021, Maxime’s The future of the Business Intelligence is open source, Mehdi’s data engineering skills report, Apache Airflow survey 2020, DataMinded’s things to consider for Argo Workflow, Spotify’s new experimentation strategy, LightUp’s hidden data outages, Confluent’s real-time analytics with Kafka & Pinot, Pinterest’s Flink deployment framework, AWS’s new feature on Hudi, and Trino’s new window function enrichments.

Michael Stonebraker: Top 10 Big Data Blunders

Some…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 32nd edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Picnic’s Data Vault modeling, Mihaileric’s why we need more data engineers, Microsoft’s onboarding data scientist checklist, Netflix’s data movement with Google Services, Redpoint Venture’s data feedback loop with SAAS applications, DoorDash’s declarative real-time feature engineering, Uber’s applying ML for internal auditing, Pinterest’s ML techniques to fight misinformation, Monte Carlo’s new data quality rules, and Anna Anisienia’s take on Airflow task group design.

Let’s start this week with some fun but also the sad reality of the data…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 31st edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Redpoint Ventures Reverse ETL, JP Morgan’s data mesh implementation, DBT’s modern data stack, ValidIO’s ML & Data trends 2021, Airbnb’s visualizing data timeline, Pinterest’s lesson learned from running Kafka at scale, Confluent’s 42 things to do once Zookeeper is gone, LinkedIn’s solving data integration problem with Apache Gobblin, Facebook’s mitigating the effect of silent data corruption, Reddit’s scaling reporting system, and LinkedIn’s GraphQL implementation of DataHub.

Redpoint Ventures: Reverse ETL — A Primer

Over the last decade…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 30th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Uber’s schema-agnostic log analytics platform, Google’s opensource model search system, Intuit’s Data Mesh strategy, Salesforce’s secure data intelligence platform, Netflix’s composable data pipeline, BrightWind’s wind analytical data hub, Apache Pinot’s star tree indexing, Squarespace’s A/B testing platform, Snowflake vs. Redshift comparison, and overview of the modern analytical stack.

Uber: Fast and Reliable Schema-Agnostic Log Analytics Platform

Elasticsearch provides a dynamic schema inference to improve the performance of log indexing. The dynamic type inference often leads to type…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 29th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s research paper on Data Cascades in High-Stakes AI, Fiddler Labs debugging ML model performance, Monte Carlo’s Data Observability Using SQL, Airbnb’s Superset adoption, Apache Kylin’s Evolution of Precomputation, Spotify’s Sorted Merge Bucket implementation, Doordash’s effective data science communication, Funding Societies Data Governance journey, QueryClick’s Self-Serve analytical journey, and Databricks Delta Lake 0.8.

Google: "Everyone wants…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 28th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.

Google: Machine…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the fourth edition of the data engineering newsletter. This week’s release is a new set of articles that focus on data orchestration, ML applications, tuning data workload, and Kafka on Kubernetes.

Airflow is a huge step forward over loosely coupled cron jobs for running the data pipeline. Dagster, a data-aware, typed, self-describing, logical orchestration graph, takes the data orchestration to the next level by focusing on local development, testable code before production, and Linking data assets…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

I’m excited to read about the GPU-accelerated streaming platform this week. NVIDIA writes about cuStreamz, the first GPU-accelerated streaming data processing library. Written in Python, it built on top of RAPIDS, the GPU-accelerator for data science libraries.

Continue on the GPU-accelerated stream processing, Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request external resources from the underlying resource management systems (e.g., Kubernetes) and accelerate your workload with those resources. …


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Apache Pinot is gaining momentum as a realtime OLAP system for data engineering needs. In this blog post, Sapient narrates its experience benchmarking Apache Pinot. The ingestion rate cross 120k entries/second on one node is impressive.

Netflix open sourced metaflow.org December 2019. Metaflow follows a layered architecture approach to run the data workload, a contrasting approach from a tightly coupled airflow’s scheduler architecture. In this post, Netflix explains how the scheduler layer integrated with the AWS step functions.

Ananth Packkildurai

I break things for living.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store