Back to Projects
GitHub h-Index Analyzer & Predictor
In Progress
Started Sep 2025
PythonMachine LearningGitHub GraphQL APIScikit-LearnData EngineeringSQLiteRandom ForestPandas
About the Project
The Goal
Can we measure a developer's impact the same way we measure academic research? For my Bachelor's thesis, I adapted the h-index metric for GitHub—using repository stars instead of citations—to quantify user influence across the platform.
Data Engineering at Scale
The biggest challenge wasn't the math; it was the data collection. GitHub's API rate limits make scraping millions of users impractical with standard methods.
- GraphQL over REST: I chose GraphQL to fetch nested repository data in single requests, maximizing data yield per rate-limit point.
- Parallel Processing: Initial sequential processing capped at ~10k users/day. By implementing a multi-token rotation system and parallel threads, I increased throughput 5x, collecting data on ~3.5 million users in just five days.
- Storage: Used SQLite for its zero-configuration footprint, storing user profiles and repository metadata in a normalized schema.
Machine Learning & Prediction
Beyond calculation, I built a model to predict a user's h-index based on profile features (followers, repo count, star distribution).
- Model: Random Forest Regressor handled the non-linear relationships better than linear models.
- Handling Skew: The h-index distribution is heavily right-skewed. Applying a log2 transformation to the target variable improved convergence and reduced the leverage of extreme outliers.
- Performance: The model achieved an R² of 0.94 on the test set. However, it consistently under-predicts extreme values (e.g., users with h-index > 100), indicating that star count alone doesn't capture viral impact.
Key Takeaways
- API Limits are Hard: Efficient data collection requires more than just code; it requires strategic authentication management.
- Features Matter:
total_starswas the strongest predictor, whilefollowers_counthad surprisingly low importance. - Honesty in Metrics: While the model performs well on average, it fails on the "top 1%" of users. This limitation is documented in the thesis rather than hidden.
Tech Stack
Python, GitHub GraphQL API, SQLite, Scikit-Learn, Pandas, Matplotlib