PDFDataExtractor
PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following publishers, and more templates are currently under development:
- Elsevier
- Royal Society of Chemistry
- Advanced Material Families (Wiley)
- Angewandte
- Chemistry A European Journal
- American Chemistry Society
- Springer (Temporarily unavailable)
This guide provides a quick tour through PDFDataExtractor concepts and functionalities.
Features
- Extract metadata information from scientific PDFs, including: title, author, abstract, journal name, journal year, journal volume, journal page number, doi, keywords, figure captions, section titles, heading, page number and references
- Chemistry-aware PDF information extraction
- Outputs PDF articles in plain text, JSON
- Extract articles from main stream chemistry and physics publishers with high precision
- Automated publisher detection
Developing Features
Web services for a more user friendly experience
Supports for more publishers
Citing
PDFDataExtractor:
Zhu, M. and Cole, J., 2022. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling, 62(7), pp.1633-1643.
This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819710), and BASF.