Hi, I'm Vineeth Guptha

Self-driven, quick Learner, passionate about solving real world problems using Data Science.

About

I am a Data Science & Machine Learning Engineer with 5 years of experience and 3 published research papers solving business problems by building scalable end-to-end Machine Learning pipelines.

I have done my Masters in Data Science at the University of San Francisco and have done my Bachelors from Indian Institute of Technology, Madras (IITM).

  • Languages: Python, C, HTML, Bash
  • Databases: PostgreSQL, MongoDB
  • Libraries: NumPy, Pandas, OpenCV, Spacy, Huggingface
  • Frameworks: Flask, Keras, TensorFlow, PyTorch, LangChain
  • Tools & Technologies: Git, Docker, AWS, GCP, JIRA

Experience

HP

Machine Learning Engineer
  • Architectured an agentic multi-hop QA framework (IP bound) using on-device local LLMs without needing any additional space in the memory, enabling users to ask complex questions and receive accurate answers by leveraging multiple data sources. Improving the response quality by 25%
  • Built game config recommendation system for Omen AI, by using Catboost and Shap dependency plots, which increased the users FPS by 20% compared to the previous models.

      • Tools: Python, Langchain, Airflow, AWS, On-device AI
Oct 2024 - Present
Federal Home Loan Bank of San Francisco logo

Data Scientist, Intern
  • Engineered an interactive LLM-based dashboard in production, integrating sentiment evaluation of articles and media agencies, and mapping to balance activity for holistic performance overview, utilizing RAG framework and functional calling mechanism.
  • Researched, developed, and deployed a climate risk modeling framework in production environment to track and estimate collateral risk of assets worth $250 billion. Streamlined an ETL pipeline and developed Power BI dashboards to help make strategic decisions.
  • Utilized Apache Spark to streamline the processing of extensive datasets while leading the development of a PowerBI dashboard for monitoring bank members' credit balances. This integration enhanced the identification of fluctuations and bolstered financial oversight. Executed complex SQL queries across more than 40 tables, ensuring smooth data integration and delivering comprehensive insights.
  • Performed hypothesis testing using Chi-square tests to evaluate loan eligibility bias among minority groups in the VantageScore credit checking framework versus the existing system, ensuring transparent and fair credit assessments for informed business decisions and identified potential to expand customer base by 33 million.

      • Tools: Python, Power BI, AWS, Apache Spark, Airflow, DAGs
Dec 2023 - June 2024
Reputation logo

Data Scientist
    Location Alias:
  • Support demand was reduced by 75% by the implementation and deployment of an automated pipeline leveraging the Fellegi-Sunter probabilistic model with 98% accuracy to map business listings from various digital directories, linking businesses to parent entities.
  • Improved data quality, consistency, and reliability by establishing a unified hierarchy of businesses through the automated pipeline and monitoring online presence across platforms efficiently.

  • AI for customer review response:
  • Developed two variants of review response system, a semi-automated and an automated approach. The automated system leverages few-shot learning and prompt engineering techniques, optimizing templates for industry-specific contexts. Integrated an output parser to structure and refine language model responses utilizing the GPT 3.5 turbo model.
  • Designed the semi-automated system to utilize a BERT model for sentiment analysis, categorizing customer feedback. Employed machine learning algorithms to extract relevant feedback from the dataset and propose responses.
  • Designed and performed A/B testing to evaluate the effectiveness of the systems. The systems collectively accelerated the rate of customer support responses by 65% and reduced the cost per review response by 110%.

  • Navigation on the product platform using Large Language Models (LLMs):
  • Developed a conversational website navigation tool by implementing Open AI GPT 3.5 Turbo model, enhancing user experience by empowering them to efficiently navigate the platform using natural language, thereby streamlining accessibility and engagement and augmenting overall usability.
  • Addressed LLM hallucination problem by implementing RAG (Retrieval-Augmented Generation) framework with additional creation of website navigation dataset to achieve accurate results.

  • Google Feature Recommendation system:
  • Developed a customized recommendation system for businesses suggesting top features to improve their discoverability and rankings in Google Maps over their competitors by implementing the XGBoost model for top feature selection and rank prediction.
  • The accuracy score has been improved from 75% to 89% by introducing a data selection strategy that statistically samples the dataset across various rank differences.

    • Tools: Python, Flask, OpenCV, Keras, Tensorflow, PyTorch, Spacy, Huggingface, LLMs
Nov 2021 - June 2023
Wipro logo

Senior AI Research Engineer
    HELIOS (Hate speech detection on Social Media):
  • Developed a real-time hate tweet identification system with geographic insights using NLP algorithms such as GPT and BERT fine-tuning to detect hateful tweets.
  • Employed active learning methods to expand the hate speech dataset from 50k to 5M tweets, integrating human annotators and ML algorithms, thereby enhancing system capabilities. Generated $250,000 in savings by optimizing human annotation resources, reducing reliance on Amazon Mechanical Turk.

  • FACTDEMIC (Fake Claim Detection and Meta-Fact Checking Through Textual Entailment-based Validation):
  • Spearheaded the design and development of a sophisticated 4-stage web application leveraging machine learning models and BERT textual entailment. Automated the identification and validation of fake news across social media platforms.
  • Incorporated open-source data mining methodologies to furnish corroborative evidence and adhered to conceptual rules to ensure accurate detection. The system has assisted fake fighters by providing potential tweets and increasing the speed of annotations by 400%.

  • Antharyami (Code-mixed Language Model):
  • Researched the poor performance of code-mixed language models (statistical to neural models) and proposed and implemented a data ingestion strategy to enhance the overall efficiency of the language model.
  • Implemented a beam search algorithm to significantly enhance word prediction capabilities within statistical language models, contributing to the development of more efficient and accurate language processing systems.

  • Tools: Python, Flask, OpenCV, Keras, Tensorflow, PyTorch
June 2019 - Nov 2021

Achievements and Research Publications

    Minority Positive Sampling for Switching Points - 11 citations

    Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions around the world. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish text. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models including the self-attention models such as GPT and BERT. Our analysis suggests switching-points are the main challenge for the LMCM performance drop, therefore in this paper we introduce the idea of minority positive sampling to selectively induce more sample to achieve better performance.
France, 2020
    Overview of constraint 2021 shared tasks: Detecting English covid-19 fake news and Hindi hostile posts - 73 citations
    The shared tasks are 'COVID19 Fake News Detection in English' and 'Hostile Post Detection in Hindi'. The tasks attracted 166 and 44 team submissions respectively. The most successful models were BERT or its variations.

    Fighting an Infodemic: COVID-19 Fake News Dataset - 441 citations
    Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression, Gradient Boost, and Support Vector Machine (SVM). We obtain the best performance of 93.46% F1-score with SVM.
2021


    Developed and implemented a large language model (LLM)-based solution, enabling businesses to navigate the product website using natural language queries, thereby enhancing option discoverability. Recognized as the global challenge winner, surpassing 80+ participants, for introducing an innovative LLM framework, showcasing expertise in the field and providing a solution proposal.
2022

Projects

Smart News Application
LLM based Smart News Application

Projects
    Tools: LLMs, Langchain, Streamlit, Prompt Engineering, RAG framework

    Bored of reading long news articles? This tool helps you quickly consume news articles by doing sentiment analysis and summarizing them and makes you updated all the time.
Article Recommendation System
Article Recommendation System

Projects
    Tools: Recommendation, NLP, Embeddings, Flask, Python, Content based recommendation

Fake News Detection
Fake News Detection

Projects
    Tools: Fine-tuning Transformers, Python, Machine Learning, Statistics Generation

    Real-time fake news detection in twitter

Skills

Languages and Databases

Python
HTML5
MySQL
PostgreSQL
Shell Scripting

Libraries

NumPy
Pandas
OpenCV
scikit-learn
matplotlib

Frameworks

Flask
Keras
TensorFlow
PyTorch

Other

Git
AWS

Education

University of San Francisco

San Francisco, California, USA

Degree: Master of Science in Data Science

    Relevant Courseworks:

    • Distributed Database Systems
    • Large Language Models MLOps
    • Data Structure and Algorithms
    • Advance Machine Learning
    • Experiments in Data Science

IIT, Madras

Chennai, India

Degree: Bachelor of Technology

    Relevant and Online Courseworks:

    • Introduction to Programming
    • Machine Learning
    • Deep Learning
    • Data Science for Engineers

Contact