CMU exposed GitHub's "underground industry chain"! Four and a half million stars are painted

#News ·2025-01-06

What? Github's stars can be fake, and the number of them has reached a staggering 4.5 million!

Most researchers publish their projects on Github for increased visibility. The star of the project has always been regarded as a key indicator to test the popularity of the project.

But a recent study by the CMU team, StarScout, has confirmed that 4.5 million suspected stars in Github are all fake!

Many projects have malicious accounts to swipe star, attract attention, and even inject malicious code into the corresponding project to attack researchers who want to replicate the project.

There is now a 15% chance that a warehouse that gets 50 stars is involved in star fraud.

图片

The thesis links: https://arxiv.org/abs/2412.13459

As the famous American psychologist Donald T. Campbell put it, "The more any quantitative social indicator is used in social decision-making, the more it tends to corrupt, the more it distorts and corrupts the social processes it is intended to monitor."

The attraction of Github's high star project, which is enough to be popular around the world, naturally cannot escape such a rule.

The graph below on GitHub warehouse star data shows how the number of warehouses earning at least 50 stars per month (blue line) and the number of warehouses suspected of having false star activity per month (orange bar) changed from August 2019 to August 2024.

  • Blue line (#Repos>=50 stars) : represents the number of GitHub warehouses that receive at least 50 stars per month.
  • Orange bar chart (#Repos w. Suspected Campaigns) : indicates the number of warehouses suspected of false star activity per month.

图片

As can be seen from the chart, while the number of warehouses with at least 50 stars is generally stable, the number of warehouses with suspected false star activity has shown a clear upward trend in recent years, especially in '24.

If you do a Google search for "buy GitHub star", you will see a number of providers. Among them, the price of each star, the minimum purchase quantity, and the arrival time of star are the same as the following table, clearly indicating the price.

What is even more incredible is that many service providers even claim to be able to get enough star in a project within a few hours, or even immediately.

图片

Therefore, GitHub libraries can purchase star for hacking attacks, spam, fake job resumes, and even spread malware for illegal profits.

For example, this project has 111 stars, but 109 of them are actually fake. The project's README file (above left) suggests using a blockchain application, but if executed, its code (bottom image) uses the hidden spawn function to call a remote file to execute a script (named a seemingly legitimate JavaScript package) to steal your cryptocurrency.

Ironically, the project had a single issue, presumably created by the victim, who warned of hidden malware.

图片

However, real developers do not understand and resist this phenomenon.

"I'm confused as to why anyone would want to buy fake GitHub stars. I mean, what's the point of having so many fake accounts following you instead of real people."

图片

How do I know if a star is fake?

The following two diagrams show the name data corresponding to the GitHub libraries involved in the fake star, respectively, the deleted library and the library that still exists.

图片

图片

It can be found that auto, bot, 2024, telegram, free, etc. are common library names suspected of star fraud.

And most of the libraries that have been removed seem to be about pirated software (cryptocurrency bots pixel-wallet-bot-free, Solana-Sniper-Bot) or gaming cheats (GTA5-cheat).

The following table describes the main characteristics of GitHub accounts involved in the star fraud activity, and it can be seen that the highest proportion of accounts have no GitHub organization, no corporate relationship, and no personal website.

In other words, if an account has a default profile picture, does not belong to an organization on GitHub, and does not have any affiliations or websites in its profile, and his project warehouse name also involves the above two words in the cloud map, then the warehouse corresponding to this account has a high probability of being involved in star fraud activities. Even intent on malicious fraud and hacking.

图片

In terms of star transactions, it is known from previous research literature that the GitHub star black market operates in at least three different ways:

  • Merchants can publicly sell GitHub Stars on their own websites, instant messaging apps, or e-commerce platforms such as Taobao.
  • GitHub users may form exchange platforms (such as GitStar or instant messaging groups) and then interstar each other's GitHub repositories.
  • A GitHub warehouse might directly incentivize the audience of its AD campaign to star the warehouse with gifts (as happened with OceanBase).

All of these operations appear to violate GitHub's Acceptable Use Policy, which prohibits the following:

  • Inauthentic interactions, such as fake accounts and automated inauthentic activity
  • Ranking abuse, such as autostar or follow
  • Activities that are motivated by rewards such as cryptocurrencies, tokens, points, gifts or other giveaways

In all three cases discussed above, the researchers believe that these purchased, exchanged, or incentivized GitHub Stars are fake because they are artificially elevated and do not truly represent any real appreciation, use, or collection of the repository by real GitHub users.

StarScout Design

图片

Overview of StarScout

On the whole, StarScout applies distributed algorithms on GHArchive to locate two features of anomalous star behavior from GitHub history: low activity features and synchronous features, both of which are likely related to false stars.

Specifically, the low activity feature is used to identify stars for accounts that are no longer active after they star for one or more code repositories; The synchronous feature recognizes stars from a cluster of n accounts that repeatedly work together to star another cluster containing m warehouses in a short window of ∆t time.

Drawing a clear boundary between a fake star and a real star is not easy, and some special cases, such as a GitHub tutorial repository that requires readers to click a star for them as part of the tutorial, can add complexity to this distinction.

StarScout uses a Bipartite Graph (Stargazer Bipartite Graph) of user and code repositories to handle the detection of these features.

In low-activity detection, StarScout identifies accounts that have only one WatchEvent (that is, only one GitHub repository point star) and at most one additional event (such as ForkEvent) on the same day.

While the detected accounts could be disposable bots controlled by fake star vendors, they could also be real users who misjudged them, such as someone who had a legitimate real account but put their GitHub on hold after clicking star for a warehouse.

To mitigate this, StarScout only considers repositories of code that have at least 50 suspected fake Stars.

This kind of behavior is something that GitHub star vendors can't circumvent, because no matter what obfuscating methods they employ, these accounts are usually newly registered one-time accounts, or star for multiple warehouses in a short period of time to meet delivery commitments.

From a mathematical point of view, all the stars on GitHub can be modeled as a binary graph: each user and code repository is a node, their star relationship constitutes the edge, and the star time is a property of the edge.

If a dummy star quotient controls a set of n accounts that star m code repositories within the promised delivery time, they will leave so-called <n, m, ∆t, ρ> time-coherent approximate binary cores in the star bidirectional graph.

Some previous studies have also shown that this approximate dichotomy is difficult to form naturally in online social networks and is highly correlated with fraudulent activity.

However, the problem of finding the largest binary kernel is NP-hard.

So StarScout re-implemented CopyCatch, a state-of-the-art distributed local search algorithm that was used by Facebook to detect fake likes. With this algorithm, StarScout detects approximate binary nuclei in the GitHub star bidirectional graph.

CopyCatch starts with a set of seed warehouses (all warehouses with ≥50 stars); It then iteratively generates a time center and increases n and m to find a locally maximum approximate binary kernel within that time center for each seed warehouse. Finally, binary nuclei greater than the predefined n and m thresholds will be treated as false stars.

While two heuristics for dealing with low activity features and synchronous features can identify significant anomaly patterns in GitHub star data, it cannot be assumed that every code repository that acquires false stars actively acquires those stars.

For example, for a very popular code repository, a fake star might seem meaningless. But inevitably, fake accounts may intentionally star popular code repositories to circumvent platform detection. Therefore, the post-processing step aims to preserve only those code repositories that have benefited significantly from the spurious star proliferation.

To do this, StarScout aggregates the number of stars per month and looks for repositories of code that meet the following criteria:

(1) Obtain more than 50 false stars in at least one month, and the proportion of false stars exceeds 50%;

(2) The proportion of false Stars (relative to all stars) for all time periods exceeds 10%.

StarScout considers these code repositories as the code repositories that initiated the fake star, and marks the accounts of the Stars at the midpoint of the surge month as those participating in the fake star activity.

In the end, StarScout detected 4.53 million fake stars in 22,915 code repositories created by 1.32 million accounts.

图片

Percentage of warehouses/accounts detected by StarScout and removed on GitHub as of October 2024

Compared to the benchmark takedown ratio (5.84% for warehouses and 4.43% for users), the takedown rate for detected warehouses and accounts is unusually high: approximately 91% of warehouses and 62% of suspected fake accounts in fake star activity have been removed.

图片

Through a comparative analysis of the distribution of GitHub events, the researchers found that warehouses and accounts with fake star activity tended to be more prone to a single star operation, and the number of other types of activity events was significantly lower than that of normal warehouses.

And even when the two are similar in number of star activities, accounts and warehouses with fake star activity typically have very little Fork, Push, and Create activity, and almost no Issue, PR, and Comment activity. This is mainly because the latter three activities are much harder to fake than the first three.

Can fake stars really turn up the heat?

The researchers also investigated whether false stars could have the same "Matthew effect" as true stars.

The purpose of the study was to explore whether fake stars can also attract more users to give real stars by increasing the popularity of fake stars.

They developed the following two hypotheses for the impact of GitHub stars:

  • H1: Accumulating real GitHub stars will help the GitHub warehouse get more real GitHub stars in the future.
  • H2: Accumulating fake GitHub stars will help the GitHub repository get more real GitHub stars in the future, but the effect is not as strong as the real star.

To test these two hypotheses, the researchers robustly estimated the longitudinal effect of the independent variable on unobserved heterogeneity (that is, factors that may influence the outcome variable but are not measured in the model) by adding fixed or random effect terms to the model.

图片

As can be seen from the table above, the H1 hypothesis is clearly supported: according to the fixed effects model, a 1% increase in monthly t-1 real stars is associated with an expected 0.36% increase in monthly T-real stars, holding all other variables constant.

Similarly, a 0.36% increase in real stars can be predicted from month t to month t+1. The effect fell to 0.15% in month t+2 and 0.11% in all subsequent months, but the effect was always positive.

In other words, warehouses with more real stars tend to get more real stars in the future, echoing the "rich get richer" phenomenon prevalent in social networking.

On the other hand, the H2 hypothesis is only partially supported: holding all other variables constant, a 1% increase in monthly t false stars is associated with an expected 0.08% increase in monthly t+1 true stars and a 0.04% increase in monthly t+2 true stars.

In other words, fake stars did have a statistically significant and vertically decreasing positive effect on attracting real stars over the next two months, but the effect was three to four times smaller than the effect of real stars.

However, a 1% increase in monthly t false stars was associated with a 0.05% decrease in average expected true stars for monthly t+2 and all subsequent months.

In general, buying fake stars may help a warehouse get real attention in the short term (i.e., less than two months), but the effect is 3 to 4 times less than the real star. And in the long run, it will undoubtedly have far-reaching negative consequences.

Finally, the researchers emphasize that the GitHub library's star indicator is not a reliable high-quality indicator, so at least it cannot be a single reference for high-risk decisions.

At the same time, the researchers also advise developers not to fake stars in order to promote their projects, because it will not help.

Instead, they suggest that repository maintainers and startup founders working in the open source space should strategically focus on facilitating progress on actual projects, rather than superficially exaggerating the number of stars. In other words, if the project is not actually high quality and well maintained, then even though high star may increase the visibility of the project in the short term, it will eventually be quickly rejected by everyone.

TAGS:

  • 13004184443

  • Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai

  • gcfai@dongfangyuzhe.com

  • wechat

  • WeChat official account

Quantum (Shanghai) Artificial Intelligence Technology Co., Ltd. ICP:沪ICP备2025113240号-1

friend link