About Me

Mohd Nauman
Data Engineer & Python Developer

I'm a passionate Data Engineer and Python Developer with a strong foundation in building and optimizing data pipelines. Currently pursuing a Bachelor's in Programming and Data Science from IIT Madras, I bring hands-on experience in SQL, Python, Spark, AWS, and Azure to solve complex data challenges.

My expertise lies in ELT processes, data warehousing, and developing efficient Python backends. I've successfully reduced compute costs by 33% and storage costs by up to 60% in previous projects, showcasing my ability to optimize both performance and resources.

Key Achievements:

  • Built PySpark pipelines handling 22+ tebibytes of data
  • Reduced processing time from days to minutes for large-scale data operations
  • Implemented cost-saving measures resulting in ₹3 lakhs monthly savings
  • Organized a successful tech event at IIT Madras with 250+ registrations

Certifications:

Harvard CS50
Azure Data Fundamentals
AWS Certified Cloud Practitioner
Cutshort Certified Python - Advanced

Technical Expertise

Apache Spark95%
AWS (S3, EC2, Redshift, EMR, Glue)90%
Azure (SQL, Blob, Functions)88%
Databricks85%
Kubernetes80%

Professional Journey

Data Ops
Ola Krutrim - Bengaluru, KA
Feb 2024 - Present

Leading data engineering initiatives and optimizing large-scale data processing pipelines.

Key Achievements:

  • Built a PySpark pipeline to handle 22 tebibytes of data, reducing processing time from 1-2 days to 20-30 minutes.
  • Developed an automation tool capable of fetching up to 5 terabytes of data from platforms like Hugging Face and arxiv.org within 60 minutes.
  • Implemented a multi-step data pipeline inspired by OBELICS, processing 230M+ web sources and extracting 50-100M Indic images with text.
  • Optimized compute costs by 33%, saving approximately ₹3 lakhs.
Python Developer
Namasys Private Limited - Remote
Jun 2022 - Feb 2024

Focused on data pipeline development, migrations, and storage optimization.

Key Achievements:

  • Engineered data migrations from data lakes to data warehouses using PySpark on EMR clusters.
  • Built and maintained over 50 data pipelines, increasing efficiency by 60%.
  • Enhanced data storage efficiency on Azure SQL and AWS Redshift, cutting storage needs by 35%.
  • Integrated data from multiple APIs into Azure SQL and AWS Redshift, deploying cloud functions for process automation.
Associate Software Engineer
Knackbout Studio Pvt Ltd - Bengaluru, KA
Aug 2021 - May 2022

Specialized in backend development and data analysis for various client projects.

Key Achievements:

  • Optimized airline forecasting process, reducing processing time by 80-90% through efficient use of pandas.
  • Reduced storage costs by 60% by transitioning from SQL to NoSQL databases.
  • Designed and implemented backend solutions using Django, handling bugs and issues effectively.
  • Built a Pyspark pipeline with caching to accelerate web analytics, achieving millisecond-level response times.

Project Portfolio

Large-Scale PySpark Pipeline
Large-Scale PySpark Pipeline
Engineered a robust PySpark pipeline capable of processing 22 tebibytes of data with optimized performance.
PySpark
AWS EMR
S3

Key Achievements:

  • Reduced processing time from 1-2 days to 20-30 minutes
  • Implemented efficient data partitioning and caching strategies
  • Integrated with AWS services for scalable and cost-effective processing
Real-time Consumer Behavior Analysis
Real-time Consumer Behavior Analysis
Developed a real-time system for generating consumer behavior personas using streaming data.
Kafka
Spark Streaming
MongoDB

Key Achievements:

  • Processed millions of events per second in real-time
  • Implemented machine learning models for dynamic persona generation
  • Designed a scalable architecture to handle increasing data volumes
OBELICS-inspired Cost Optimization
OBELICS-inspired Cost Optimization
Implemented an OBELICS-inspired pipeline that resulted in significant cost savings.
AWS
Terraform
Python

Key Achievements:

  • Reduced compute costs by 33%, saving approximately ₹3 lakhs monthly
  • Processed metadata from 230M+ web sources efficiently
  • Implemented advanced filtering techniques for high-quality data extraction
Airline Forecasting System Optimization
Airline Forecasting System Optimization
Optimized airline forecasting system, dramatically reducing processing time through advanced algorithms.
Python
Pandas
Scikit-learn

Key Achievements:

  • Reduced processing time by 80-90% through efficient use of Pandas
  • Improved forecast accuracy, leading to better resource allocation
  • Implemented automated testing to ensure consistent results

Data Insights

Skills Overview
Comprehensive overview of technical skills and proficiency levels

Get In Touch

Contact Information
Feel free to reach out for opportunities or collaborations