Mykhailo Pavlov
Back to Projects

GitHub h-Index Analyzer & Predictor

In Progress
Started Sep 2025
PythonMachine LearningGitHub GraphQL APIScikit-LearnData EngineeringSQLiteRandom ForestPandas

About the Project

The Goal

Can we measure a developer's impact the same way we measure academic research? For my Bachelor's thesis, I adapted the h-index metric for GitHub—using repository stars instead of citations—to quantify user influence across the platform.

Data Engineering at Scale

The biggest challenge wasn't the math; it was the data collection. GitHub's API rate limits make scraping millions of users impractical with standard methods.

  • GraphQL over REST: I chose GraphQL to fetch nested repository data in single requests, maximizing data yield per rate-limit point.
  • Parallel Processing: Initial sequential processing capped at ~10k users/day. By implementing a multi-token rotation system and parallel threads, I increased throughput 5x, collecting data on ~3.5 million users in just five days.
  • Storage: Used SQLite for its zero-configuration footprint, storing user profiles and repository metadata in a normalized schema.

Machine Learning & Prediction

Beyond calculation, I built a model to predict a user's h-index based on profile features (followers, repo count, star distribution).

  • Model: Random Forest Regressor handled the non-linear relationships better than linear models.
  • Handling Skew: The h-index distribution is heavily right-skewed. Applying a log2 transformation to the target variable improved convergence and reduced the leverage of extreme outliers.
  • Performance: The model achieved an R² of 0.94 on the test set. However, it consistently under-predicts extreme values (e.g., users with h-index > 100), indicating that star count alone doesn't capture viral impact.

Key Takeaways

  • API Limits are Hard: Efficient data collection requires more than just code; it requires strategic authentication management.
  • Features Matter: total_stars was the strongest predictor, while followers_count had surprisingly low importance.
  • Honesty in Metrics: While the model performs well on average, it fails on the "top 1%" of users. This limitation is documented in the thesis rather than hidden.

Tech Stack

Python, GitHub GraphQL API, SQLite, Scikit-Learn, Pandas, Matplotlib