The predominant part of my Master’s research involves using a dataset about published academic papers in the computer science domain. I am using various ranking algorithms on this dataset to identify important papers and authors. During the course of my research I looked at some trends about the way researchers have published articles over time. The reason for writing this blog post is because I have stumbled across some anomaly within this dataset which I have, until now, failed to explain and want to share the findings. Therefore, if someone believes that they have an explanation for this anomaly, which I will describe shortly, please leave a comment or contact me directly.
The dataset, which I am currently using, contains information about published papers, their authors and a reference list for each paper. This dataset is relatively large. It contains 4.8 million papers which are either journal articles or articles published in conference proceedings, over 21 million citations and nearly 4 million authors. For the following experiments a subset of this dataset, the papers that fall into the computer science domain, is used. The number of computer science papers is 1,836,500 with 20 million references. Unfortunately, the dataset and additional information about the dataset cannot be disclosed because of restricting conditions.
Some Publication Trends
In order to not only write about problems in this blog post, I will begin by showing some general trends within this dataset that might be of interest to some. These trends are relatively obvious and can easily be explained. For those of you that are only interested in looking at the anomaly you can skip this first part.
First of all, single-authored papers are slowly but surely dying out. As you can see from the first graph, the percentage of single-authored papers has dropped below 10% after 2008.
The reasons for this trend are clear. Research areas are becoming increasingly more complex and often expert knowledge is required that is not easily acquired. Moreover, making use of collaboration a researcher can publish more articles. The most important aspect, however, is that multi-authored papers, on average, receive higher citation counts. In other words, it pays off to publish multi-authored papers .
This is also true for the dataset under consideration. In the above graph, there is a clear indication that, on average, papers with more authors receive more citations.
Another interesting graph is the following which shows that the average number of authors per paper is also increasing over the years. This trend is analogous to the trend of single-authored papers slowly diminishing. It is interesting to notice, however, that researchers in the computer science domain are becoming more of team players since the average number of authors per article nowadays is over three authors.
Another interesting trend in the computer science domain is depicted by the following graph. In this graph the average ratio of journal to conference articles published by authors is plotted against the years since an author’s first publication.
As one can see the average number of conference articles that researchers published is always larger (at least up to x=42 active years) than the number of journal articles they publish. It would be interesting to see whether this holds true for different academic areas. Most likely this is simply a computer science trend since conferences are very important in the computer science domain. From the above graph one can see that the longer researchers are active the more journal articles they publish compared to conference articles.
Alright, let’s start with the weird stuff. During the course of investigating various trends, I stumbled across the following problem: after 20 years of active research the average number of papers published by authors is increasing suddenly and strongly. Here the years of active research are the number of years since an author’s first publication. Unfortunately, additional information about authors such as their ages and positions are not known. Therefore, it is difficult to get a real grasp on this abrupt change in publication patterns.
The graph looks as follows for conference articles:
So after 20 years of active research, researchers on average publish 2.68 articles a year. This value suddenly increases to 3.1 articles per year the following year.
The picture is similar for journal publications where the jump is from 1.8 to about 1.98 articles a year as one can see in the following graph:
This anomaly does not stem from a sudden decrease in the number of authors or a rapid increase of articles published as the following graph shows with the various author and article counts x years after an author’s first publication.
Another graph that might bring some insight is the following:
In this graph authors are put into two distinct buckets. For each column in this graph, the left bar indicates the average number of articles published by authors that stopped publishing 15, 16,…, 25 years after their first publication. The right hand bar is split in two. The bottom, blue bar shows the average number of articles published in the same timespan but by researchers that continued publishing after the year break. The red, top bar indicates the average number of articles that were publishing by those same researchers after the year break. From this graph one can see that the researchers that continue publishing are the researchers that have a very high continuous research output compared to the researchers that stop publishing earlier. It also shows that the research output before the year break is the portion that contributes heavily to the overall values.
If anyone would like to see other types of graphs, please let me know and I will provide them.
 Aksnes, DW. Characteristics of highly cited papers. Research Evaluation (2003) 12(3): 159-170.