About Me
I'm a passionate Data Engineer and Python Developer with a strong foundation in building and optimizing data pipelines. Currently pursuing a Bachelor's in Programming and Data Science from IIT Madras, I bring hands-on experience in SQL, Python, Spark, AWS, and Azure to solve complex data challenges.
My expertise lies in ELT processes, data warehousing, and developing efficient Python backends. I've successfully reduced compute costs by 33% and storage costs by up to 60% in previous projects, showcasing my ability to optimize both performance and resources.
Key Achievements:
- Built PySpark pipelines handling 22+ tebibytes of data
- Reduced processing time from days to minutes for large-scale data operations
- Implemented cost-saving measures resulting in ₹3 lakhs monthly savings
- Organized a successful tech event at IIT Madras with 250+ registrations
Certifications:
Technical Expertise
Professional Journey
Leading data engineering initiatives and optimizing large-scale data processing pipelines.
Key Achievements:
- Built a PySpark pipeline to handle 22 tebibytes of data, reducing processing time from 1-2 days to 20-30 minutes.
- Developed an automation tool capable of fetching up to 5 terabytes of data from platforms like Hugging Face and arxiv.org within 60 minutes.
- Implemented a multi-step data pipeline inspired by OBELICS, processing 230M+ web sources and extracting 50-100M Indic images with text.
- Optimized compute costs by 33%, saving approximately ₹3 lakhs.
Focused on data pipeline development, migrations, and storage optimization.
Key Achievements:
- Engineered data migrations from data lakes to data warehouses using PySpark on EMR clusters.
- Built and maintained over 50 data pipelines, increasing efficiency by 60%.
- Enhanced data storage efficiency on Azure SQL and AWS Redshift, cutting storage needs by 35%.
- Integrated data from multiple APIs into Azure SQL and AWS Redshift, deploying cloud functions for process automation.
Specialized in backend development and data analysis for various client projects.
Key Achievements:
- Optimized airline forecasting process, reducing processing time by 80-90% through efficient use of pandas.
- Reduced storage costs by 60% by transitioning from SQL to NoSQL databases.
- Designed and implemented backend solutions using Django, handling bugs and issues effectively.
- Built a Pyspark pipeline with caching to accelerate web analytics, achieving millisecond-level response times.
Project Portfolio
Key Achievements:
- Reduced processing time from 1-2 days to 20-30 minutes
- Implemented efficient data partitioning and caching strategies
- Integrated with AWS services for scalable and cost-effective processing
Key Achievements:
- Processed millions of events per second in real-time
- Implemented machine learning models for dynamic persona generation
- Designed a scalable architecture to handle increasing data volumes
Key Achievements:
- Reduced compute costs by 33%, saving approximately ₹3 lakhs monthly
- Processed metadata from 230M+ web sources efficiently
- Implemented advanced filtering techniques for high-quality data extraction
Key Achievements:
- Reduced processing time by 80-90% through efficient use of Pandas
- Improved forecast accuracy, leading to better resource allocation
- Implemented automated testing to ensure consistent results