Hi, I'm Vineeth Guptha
Self-driven, quick Learner, passionate about solving real world problems using Data Science.
About
I am a Data Science & Machine Learning Engineer with 5 years of experience and 3 published research papers solving business problems by building scalable end-to-end Machine Learning pipelines.
I have done my Masters in Data Science at the University of San Francisco and have done my Bachelors from Indian Institute of Technology, Madras (IITM).
- Languages: Python, C, HTML, Bash
- Databases: PostgreSQL, MongoDB
- Libraries: NumPy, Pandas, OpenCV, Spacy, Huggingface
- Frameworks: Flask, Keras, TensorFlow, PyTorch, LangChain
- Tools & Technologies: Git, Docker, AWS, GCP, JIRA
Experience
- Architectured an agentic multi-hop QA framework (IP bound) using on-device local LLMs without needing any additional space in the memory, enabling users to ask complex questions and receive accurate answers by leveraging multiple data sources. Improving the response quality by 25%
- Built game config recommendation system for Omen AI, by using Catboost and Shap dependency plots, which increased the users FPS by 20% compared to the previous models.
- Tools: Python, Langchain, Airflow, AWS, On-device AI
- Engineered an interactive LLM-based dashboard in production, integrating sentiment evaluation of articles and media agencies, and mapping to balance activity for holistic performance overview, utilizing RAG framework and functional calling mechanism.
- Researched, developed, and deployed a climate risk modeling framework in production environment to track and estimate collateral risk of assets worth $250 billion. Streamlined an ETL pipeline and developed Power BI dashboards to help make strategic decisions.
- Utilized Apache Spark to streamline the processing of extensive datasets while leading the development of a PowerBI dashboard for monitoring bank members' credit balances. This integration enhanced the identification of fluctuations and bolstered financial oversight. Executed complex SQL queries across more than 40 tables, ensuring smooth data integration and delivering comprehensive insights.
- Performed hypothesis testing using Chi-square tests to evaluate loan eligibility bias among minority groups in the VantageScore credit checking framework versus the existing system, ensuring transparent and fair credit assessments for informed business decisions and identified potential to expand customer base by 33 million.
- Tools: Python, Power BI, AWS, Apache Spark, Airflow, DAGs
-
Location Alias:
- Support demand was reduced by 75% by the implementation and deployment of an automated pipeline leveraging the Fellegi-Sunter probabilistic model with 98% accuracy to map business listings from various digital directories, linking businesses to parent entities.
- Improved data quality, consistency, and reliability by establishing a unified hierarchy of businesses through the automated pipeline and monitoring online presence across platforms efficiently.
- Developed two variants of review response system, a semi-automated and an automated approach. The automated system leverages few-shot learning and prompt engineering techniques, optimizing templates for industry-specific contexts. Integrated an output parser to structure and refine language model responses utilizing the GPT 3.5 turbo model.
- Designed the semi-automated system to utilize a BERT model for sentiment analysis, categorizing customer feedback. Employed machine learning algorithms to extract relevant feedback from the dataset and propose responses.
- Designed and performed A/B testing to evaluate the effectiveness of the systems. The systems collectively accelerated the rate of customer support responses by 65% and reduced the cost per review response by 110%.
- Developed a conversational website navigation tool by implementing Open AI GPT 3.5 Turbo model, enhancing user experience by empowering them to efficiently navigate the platform using natural language, thereby streamlining accessibility and engagement and augmenting overall usability.
- Addressed LLM hallucination problem by implementing RAG (Retrieval-Augmented Generation) framework with additional creation of website navigation dataset to achieve accurate results.
- Developed a customized recommendation system for businesses suggesting top features to improve their discoverability and rankings in Google Maps over their competitors by implementing the XGBoost model for top feature selection and rank prediction.
- The accuracy score has been improved from 75% to 89% by introducing a data selection strategy that statistically samples the dataset across various rank differences.
AI for customer review response:
Navigation on the product platform using Large Language Models (LLMs):
Google Feature Recommendation system:
- Tools: Python, Flask, OpenCV, Keras, Tensorflow, PyTorch, Spacy, Huggingface, LLMs
-
HELIOS (Hate speech detection on Social Media):
- Developed a real-time hate tweet identification system with geographic insights using NLP algorithms such as GPT and BERT fine-tuning to detect hateful tweets.
- Employed active learning methods to expand the hate speech dataset from 50k to 5M tweets, integrating human annotators and ML algorithms, thereby enhancing system capabilities. Generated $250,000 in savings by optimizing human annotation resources, reducing reliance on Amazon Mechanical Turk.
- Spearheaded the design and development of a sophisticated 4-stage web application leveraging machine learning models and BERT textual entailment. Automated the identification and validation of fake news across social media platforms.
- Incorporated open-source data mining methodologies to furnish corroborative evidence and adhered to conceptual rules to ensure accurate detection. The system has assisted fake fighters by providing potential tweets and increasing the speed of annotations by 400%.
- Researched the poor performance of code-mixed language models (statistical to neural models) and proposed and implemented a data ingestion strategy to enhance the overall efficiency of the language model.
- Implemented a beam search algorithm to significantly enhance word prediction capabilities within statistical language models, contributing to the development of more efficient and accurate language processing systems.
FACTDEMIC (Fake Claim Detection and Meta-Fact Checking Through Textual Entailment-based Validation):
Antharyami (Code-mixed Language Model):
Tools: Python, Flask, OpenCV, Keras, Tensorflow, PyTorch
Achievements and Research Publications
-
Minority Positive Sampling for Switching Points - 11 citations
Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions around the world. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish text. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models including the self-attention models such as GPT and BERT. Our analysis suggests switching-points are the main challenge for the LMCM performance drop, therefore in this paper we introduce the idea of minority positive sampling to selectively induce more sample to achieve better performance.
-
Overview of constraint 2021 shared tasks: Detecting English covid-19 fake news and Hindi hostile posts - 73 citations
The shared tasks are 'COVID19 Fake News Detection in English' and 'Hostile Post Detection in Hindi'. The tasks attracted 166 and 44 team submissions respectively. The most successful models were BERT or its variations.
Fighting an Infodemic: COVID-19 Fake News Dataset - 441 citations
Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression, Gradient Boost, and Support Vector Machine (SVM). We obtain the best performance of 93.46% F1-score with SVM.
Developed and implemented a large language model (LLM)-based solution, enabling businesses to navigate the product website using natural language queries, thereby enhancing option discoverability. Recognized as the global challenge winner, surpassing 80+ participants, for introducing an innovative LLM framework, showcasing expertise in the field and providing a solution proposal.
Projects
Skills
Languages and Databases





Libraries





Frameworks





Other


Education
San Francisco, California, USA
Degree: Master of Science in Data Science
- Distributed Database Systems
- Large Language Models MLOps
- Data Structure and Algorithms
- Advance Machine Learning
- Experiments in Data Science
Relevant Courseworks:
Chennai, India
Degree: Bachelor of Technology
- Introduction to Programming
- Machine Learning
- Deep Learning
- Data Science for Engineers
Relevant and Online Courseworks: