FYI.

This story is over 5 years old.

Tech

A Small Army of Plagiarists Is Clogging Up Physics Research

According to a new analysis, one in 16 authors on arXiv is copying others' work.
​Image: Rasmus Olsen/Flickr

​The ​arXiv e-print server is an awe-inspiring place. It's here that most of the world's math, physics, and computer science papers/studies are initially posted, at a rate of hundreds per day. The archive, which was founded in 1991 by Cornell physicist Paul Ginsparg, is approaching its millionth submissions. Indeed, it's easy enough to become utterly lost within arXiv, both intellectually and literally. It's like a Library of Babel for the coldest, hardest sciences, and a visit often becomes a disquieting experience just in the realization that almost all of it will mostly go unseen.

Advertisement

You can find a whole lot of things later published in leading peer-reviewed journals (those publishing physics/math research) like Nature or titles within the IEEE family (Computer, Applied Physics Letters) posted on arXiv first, sometimes months (years even) earlier. This is the pulse for better or for worse.

These early versions will be pre-review versions. Many if not most papers in the archive won't make it to a proper journal at all. ArXiv isn't itself peer-reviewed, but papers go through an automated quality-control check, the final stage of which is an attempt to match text within a given study with text already existing in the archive.

A study published last week in the Proceedings of the National Academy of Sciences, authored by Ginsparg and fellow Cornell physicist Daniel Citron, gives another perspective on what turns out to be a very deep problem.

Based on this automated plagiarism check, the pair discovered that around 1 in 16 authors published in the archive used big chunks of copied material, while 1 in 1,000 were found to have copied more than a full paragraph, ​according to Science Insider, which received a complete copy of the dataset underlying the PNAS findings (from which the map below was created). It would seem also that there are serial offenders, with the 3.2 percent of flagged papers coming from around 6 percent of all authors.

When a paper is flagged on arXiv for copied material ("text overlap") it's not rejected or tossed out. It just gets a metadata tag. Some authors choose to leave the paper as is with the note, and others make corrections and resubmit. Another category responds with "indignant objection," according to the PNAS study. "Some authors have insisted that there could not possibly be text overlap," Citron and Ginsparg write, "though the heuristics in place to avoid flagging false positives have proven reliable."

Fortunately, Citron and Ginsparg also found a significant correlation between how much a given paper copies from elsewhere and how often that paper is itself cited in other papers. More copying means less citations and, thus, less impact.

"Since the articles are less frequently cited and presumably little read, it is tempting to speculate that the reused content in these articles goes largely unnoticed and undetected," they write. "Another reason that text reuse might go undetected is that the articles from which the text is copied are also less well-read."

Citron and Gisparg's report can also be found in open-access pre-print form on arXiv—of course.