A giant free index of global research articles published online

Technologist Carl Malamud.Credit: Smita Sharma

In a project that could unlock the world’s research papers for easier computer analysis, an American technologist has published online a gigantic index of words and short sentences contained in more than 100 million journal articles, including many paid articles.

The catalog, which was released on October 7 and is free to use, contains tables of more than 355 billion words and phrase fragments listed next to the articles in which they appear. It’s an effort to help scientists use software to glean information from published work, even if they have no legal access to the underlying documents, says its creator, Carl Malamud. He published the records under the auspices of Public Resource, a nonprofit company in Sebastopol, Calif., Which he founded.

Malamud says that because his index does not contain the full text of articles, but only extracts from sentences of up to five words, his publication does not violate publishers’ copyright restrictions on the reuse of paid articles. However, a legal expert says editors could question the legality of how Malamud created the index in the first place.

Some researchers who had early access to the index say it is a major development to help them search literature with software – a procedure known as text mining. Gitanjali Yadav, a computational biologist at the University of Cambridge, UK who studies volatile organic compounds emitted by plants, says she aims to comb through the Malamud index to produce analyzes of chemicals plants described in the world’s research papers. “There is no way for me – or anyone else – to analyze or experimentally measure the chemical footprint of every plant species on Earth. Much of the information we are looking for already exists, in the literature. published, “she said. But researchers are constrained by the lack of access to many documents,” Yadav adds.

The General Malamud Index, as he calls it, aims to solve the problems faced by researchers such as Yadav. Computer scientists are already texting mining articles to create databases of genes, drugs and chemicals found in the literature, and to explore article content faster than a human could read. But they often note that publishers ultimately control the speed and scope of their work, and that scientists are limited to retrieving only open access articles, or articles to which they (or their institutions) subscribe. Some publishers have said that researchers looking to extract text from paid articles need their permission.

And while free search engines like Google Scholar have indexed the text of paid literature with publishers’ consent, they only allow users to search with certain types of text queries and restrict automated search. This does not allow for large-scale computerized analysis using more specialized research, Malamud says.

Terabytes of data

Malamud’s Project is his latest venture in a career dedicated to posting locked-down information for free online access – often facing legal challenges. He initially focused on the publication of legal and financial information produced by the government. But more recently, he has turned his attention to opening up scientific literature.

He started with a project to allow scientists to SMS – but not read – a giant research paper store that he maintains on a server in India; an idea he says he’s still working on. The General Index now allows anyone to mine scientific work, but it does not have its own web search portal, so if scientists want to search for it, they will have to upload its files and develop their own. programs. Malamud hopes that users will make any search engines they create available to others.

In its compressed format, the catalog totals nearly 5 terabytes, then expands to 38 terabytes. In addition to sentence fragments, the files also include tables of nearly 20 billion keywords in the literature, and tables of article title, authors, and DOI (article identifier), so that users can find a complete article if they have access to this reading.

Michael Carroll, a legal researcher at the American University Washington College of Law in Washington DC, says the distribution of the index should be legal worldwide because the files do not copy an underlying article enough to violate the law. Publisher’s copyright – although laws vary by country. “Copyright does not protect facts and ideas, and those results would be treated as a communication of facts derived from the analysis of copyrighted articles,” he said.

The only legal question, Carroll adds, is whether Malamud obtained and copied the underlying documents without violating the publishers’ terms. Malamud says he had to obtain copies of the 107 million articles referenced in the index to create it; he declined to say how, but points out that researchers will not have access to the full texts of the articles, which are stored in a secure and undisclosed location in the United States.

“I am very confident that what I am doing is legal. We are not doing this to provoke a lawsuit, we are doing it to advance science, ”he says.

Nature contacted six editors about the General Index for this article: all but one declined to comment. In a statement, Springer Nature said the company supports open research initiatives that use technology and algorithms to meet the needs of researchers. “We have seen some initiatives run into problems, however, when the necessary rights have not been secured to enable their sustainability,” the statement added. (Springer Nature publishes this journal; NatureThe news team is editorial independent of its publisher.)

Another legal researcher, Arul George Scaria of National Law University in Delhi, says any publisher who tries to use copyright laws to prevent researchers from using the General Index “will end up being disappointed.” . The publication of the index, says Scaria, is a “major development for the wealth of information it has unlocked from those 107 million journal articles.”

About Mark A. Tomlin

Check Also

Roadmap for a literature review

By Prof. RA Seetha Bandara, Board Member of the Sri …

Leave a Reply

Your email address will not be published.