Data Engineering Weekly #1

Published in

Data Engineering Weekly

4 min readFeb 3, 2021

The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Privacy often an afterthought development in the data world. There are numerous ways one might betray someone’s privacy, but they are evident in most everyday situations. The New York Times wrote their thought on data privacy. The post is a good overview of privacy, useful links, and what are the steps NYT is doing in marketing and advertisement on their user’s privacy.

How The New York Times Thinks About Your Privacy

Online privacy is complex, but it doesn’t have to be.

open.nytimes.com

The popularity of microservices adds complexity to enforce data privacy policies over the period. The data often flows through an organization, duplicate multiple times without any accountability. Tracing the data flow and implement security policy is a challenge. Facebook writes about how a scalable data classification system helps to enforce the data policies.

Scalable data classification for security, privacy - Facebook Engineering

We've built a data classification system that uses multiple data signals, a scalable system architecture, and machine…

engineering.fb.com

Poor data quality leads to unusable data. How much can you trust your data is a question in the minds of every data consumers. Thoughtworks wrote an interesting article on the same with an introduction to opensource library deequ from AWS lab.

How much can you trust your data?

Data is the fuel for intelligent decision making for both humans and machines. Just like high quality fuel ensures that…

www.thoughtworks.com

The Spark + AI Summit 2020 ended in the last week of June-2020. In case you missed it, all the slides and the talk available on the summit page.

Spark AI NA Summit 2020 Schedule - Databricks

Databricks Spark AI Summit is the world's largest data & machine learning conference in the world. Review the 2020…

databricks.com

The Klarna data team wrote an excellent summarization of the summit.

Highlights from Spark+AI Summit 2020 for Data engineers

In these takeaways focusing on the data engineering topics, I’ll provide as resources, the most interesting talks I've…

engineering.klarna.com

Data discoverability is an essential aspect of the data infrastructure. The value proportion of a data warehouse system exponentially decreases with a weak data discovery system. The Shopify data team writes about their data discovery system, which is an excellent comprehensive overview of a data discovery design.

How We're Solving Data Discovery Challenges at Shopify

Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003…

engineering.shopify.com

Catalog services are an essential metadata engine for data discovery and schema management. Hive meta store, AWS Glue data catalog are some of the catalog services used in data infrastructure. Apache Flink 1.9 added catalog integration, and this blog post is describing how to integrate Apache Flink with the Hive and Postgress based catalog services.

Sharing is caring - Catalogs in Flink SQL

23 Jul 2020 Dawid Wysakowicz (@dwysakowicz) With an ever-growing number of people working with data, it's a common…

flink.apache.org

The University of Florida and NVIDIA Tuesday unveiled a plan to build the world’s fastest AI supercomputer in academia, delivering 700 petaflops of AI performance.

University of Florida, NVIDIA to Build Fastest AI Supercomputer in Academia | NVIDIA Blog

The University of Florida and NVIDIA Tuesday unveiled a plan to build the world's fastest AI supercomputer in academia…

blogs.nvidia.com

TimeZone is a complicated yet crucial part of data infrastructure. Databricks writes an excellent overview of TimeZone, Dates, and Timestamp with Spark 3.0

How to Effectively Use Dates and Timestamps in Spark 3.0

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing…

databricks.com

Pinterest writes shopping intent ML model to drive the shopping upsells Pinterest search. The evolution of the model from the upsell click rate model to the “long click” model is an exciting read.

Driving Shopping Upsells from Pinterest Search

Felix Zhou | Shopping, Weiran Li | Shopping, Somnath Banerjee | Shopping

medium.com

Walmart wrote about stream processing with Spring Cloud. Spring Cloud provides stream processing on top of the familiar spring framework. The post gives an introduction to Spring Cloud, a sample application, and how to unit test.

Streaming with Spring Cloud

Hands-on with Spring Cloud Stream