Satvik Jadhav

Apache Spark In-Memory Computations: A Mini Deep Dive

How Apache Spark utilizes in-memory computations to accelerate data transformations, and the benefits of in-memory processing over disk operations and the impact of hardware choices on performance.

Introduction to Data Modeling

Data Modeling Diagramming the data objects/points in an information system Any type of data modeling will be always linked to our business needs and requirements, and how easy we want the information to be shown to the end users Data modeling is now about the how the data is going to be processed, but how the the data is best represented Thinking about how we are going to be storing the data, segregate it, build the logic for the data to be fetched ...

What is Docker

What is docker? — IBM definition Docker is an open source containerization platform. It enables developers to package applications into containers—standardized executable components combining application source code with the operating system (OS) libraries and dependencies required to run that code in any environment Containerization — IBM definition Containerization is the packaging of software code with just the operating system (OS) libraries and dependencies required to run the code to create a single lightweight executable—called a container—that runs consistently on any infrastructure ...

Use PySpark Locally With Docker

Installing Spark on Linux Here we’ll learn how to install Spark 3.0.3 for Linux. We tested it on Ubuntu 20.04 (also WSL), but it should work for other Linux distros as well Installing Java Download OpenJDK 11 or Oracle JDK 11 (It’s important that the version is 11 - spark requires 8 or 11) We’ll use OpenJDK Download it (e.g. to ~/spark): 1 wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz Unpack it: ...

Order of Execution in a SQL Query

SQL Query Order of Execution Each SQL query begins with finding the data that we need in a database, this data is then filtered down into something that can be processed and understood as quickly as possible. Because each part of the query is executed sequentially, it’s important to understand the order of execution so that we know what results are accessible where. Let’s consider the below mentioned query : 1 2 3 4 5 6 7 8 9 SELECT DISTINCT column, AGG_FUNC(column_or_expression), … FROM mytable JOIN another_table ON mytable.column = another_table.column WHERE constraint_expression GROUP BY column HAVING constraint_expression ORDER BY column ASC/DESC LIMIT count OFFSET COUNT; Query order of execution 1. FROM and JOINs ...