Representation Learning and Novelty Analysis in the Open-source Development Community

Please read our full blog here.

Nowadays, programming has become an integral component of many career fields. App designers rely much on languages like Java to implement full-stack development of new functions on mobiles. Data analysts usually use languages like Python or R to gain insights for decision-making in business. In most cases, you would not start your work merely from the very basic built-in data types and functions. We need the assistance of well-developed libraries, packages, or modules to boost the coding process. For example, as a computational social scientist, I often use Python libraries like Numpy, Scipy, Pandas to manage my data and toolkits like Scikit-learn, Tensorflow, and Pytorch to build AI models to train my data.

As you have more experience with these third-party libraries and packages, you might have questions like where can I find new packages of certain functions? How do I know whether a library is reliable or not? What if I find some bugs when using the packages? Moreover, if I want to develop something of my own or collaborate with others, is there an efficient approach? All such questions lead to the emergence of open-source communities, defined as semi-organized collections of contributors with shared interests to collaboratively build software (libraries, modules, and packages) that can be shared by users inside or outside the community.

Github, the largest global open-source community, has millions of public open-source projects crossing different programming languages. Contributors join forces to develop their projects using GitHub repositories. Users can fork the repositories of their interest to their accounts to use other toolkits in their projects. Meanwhile, it is recommended for each repository to have a Readme file to briefly introduce the repository to the users, and the content of readme files would be automatically loaded on the repository page. Besides, users can report their encountered issues when using the toolkits to the developers via the “issues” channel, which could help to improve their functionality later. Users can also check social statistics like stars or forks to judge how this repository/toolkit is liked by others.

In this big community, some repositories were built majorly by imitating and improving certain functions in existing repositories. For example, Zetacoin, an experimental digital currency that operates with no central authority, is built on the foundation of Bitcoin. On the contrary, other repositories develop new pipelines or frameworks to realize new functions. For instance, the famous NLP library gensim implements many popular text mining models with CPython under the hood, making it favored by many researchers and text analysts. Because of the diversity in developing different repositories, the repositories can have very different levels of novelty. These differences can be reflected in many perspectives of a repository, for example, the source code stored in the repositories and the readme contents. Though a higher level of novelty is believed to benefit the entities in many business cases and technological fields, how well this law establishes itself in the open-source community is unclear. Therefore, we want to take the first step to look into this issue in this project.

In previous research of open-source communities, most researchers only focus on the collaborative network in the communities or some descriptive statistics, which might not provide sufficient information about the novelty of repositories. Moreover, valuable data like the files stored in the repositories are inaccessible to the researchers or neglected. Recently, scholars at the Knowledge Lab of the University of Chicago collected large-scale data from all public GitHub repositories active in 2019. The data contains almost every perspective of the repositories, ranging from the source code to the contributor lists. We chose the repositories mainly written in Python and Java (having at least 50% of the files written in Python or Java) as the two target sets in this project. To make the most of different data sources, we used an embedding-based approach to represent the information from the data as numeric vectors in our novelty analysis. Specifically, this project has three primary procedures:

Representation Learning: generating embeddings that can represent the Github repositories with different types of data (readme text, source code, and co-contributor-based network)
Embedding Evaluation: evaluating the quality of embeddings intrinsically (embedding space visualizations) and extrinsically (a multi-label classification task on topic tags)
Novelty Analysis: measuring the novelty score of the repositories with the best-qualified embeddings and exploring the relationship between the novelty and the popularity of the repositories.

Continue reading to find out the details and findings of the three steps if you are interested!

Read our full blog here.

MACS 37000 (Spring 2021) Thinking with Deep Learning for Complex Social & Cultural Data Analysis

Submit a Comment Cancel reply

Recent Posts