Default extraction

PDFDataExtractor automatically choose the template to use to perform the extraction.

Text Mode

PDFDataExtractor outputs pure text by default, following below to use this feature:

# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Get pure text
pdf.plaintext()

Semantic Mode

PDFDataExtractor can also output results as semantic information, following below to use this feature:

# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# Get Caption
pdf.caption()

# Get Keywords
pdf.keywords()

# Get Title
pdf.title()

# Get DOI
pdf.doi()

# Get Abstract
pdf.abstract()

# Get Journal
pdf.journal()

# Get Journal Name
pdf.journal('name')

# Get Journal Year
pdf.journal('year')

# Get Journal Volume
pdf.journal('volume')

# Get Journal Page
pdf.journal('page')

# Get Section titles and corresponding text
pdf.section()

# Get References
pdf.reference()

Chemistry Mode

You can use the flag “chem=Ture” to instruct the function to carry out chemistry related information extraction at the same time when extracting metadata, using ChemDataExtractor

# Path to the PDF file
file = r'path to the file'

# Create an instance
reader = Reader()

# Read the file
pdf = reader.read_file(file)

# Show extracted chemical information
r = pdf.abstract(chem=True)
r.records.serialize()

Image Mode (Temporarily unavailable)

PDFDataExtractor can also export images in the PDF, following below to use this feature:

# Import PDFDataExtractor
from pdfdataextractor import Reader

# Spefify the path to the PDF file
path = r'the path to the PDF file'

# Create an instance
file = Reader()

# Read the file
pdf = file.read_file(path)

# To access a specific image
pdf.iamge()[0]

Known Issues

In ACS

In ACS, a few journals have two section title styles existing at the same time, namely: numbered one and ■ one. This could confuse the title filtration function because two styles have largely different font sizes. But this won’t affect reference extraction
Reference extracted might not be in order
Parts of extracted reference could be missing

In Elesvier

Potentially weak journal extraction leads to missing journal information
Unnumbered references can be messy

In RSC

Title can be missing
Journal year, volume and page numbers can be missing in certain articles
Some section titles can be missed but reference section remains solid

In Advanced Family

Reference entries can be mixed
Keywords can be found inside reference entries, roughly 1 in 20
Some authors place their bio at the very end, such words are not excluded from reference at the moment

In CAEJ

Keywords can be incomplete
In AngewandteKeywords might not be in order