PDFigCapX and FigSplit - a Pipeline for Extracting Figures, SubFigures and Captions from Scientific Publications, Pengyuan Li

views comments

Figures and captions convey essential information in scientific publications. As such, there is a growing interest in storing, browsing and mining published figures as a source of knowledge. Notably, the first fundamental step, namely extracting figures and captions from scientific publications is neither well-studied nor yet well-addressed. Moreover, as the vast majority of published figures are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information, segmenting such images into constituent panels is another necessary step toward displaying and utilizing published images. We introduce an effective pipeline comprising two systems: PDFigCapX for identifying and extracting figures and captions from biomedical documents, and FigSplit for splitting the extracted compound figures into their constituent subfigures. We have tested both systems on existing and on newly assembled datasets. The extensive experimental results demonstrate significant improvement and effectiveness compared to other state-of-the-art methods. Our proposed pipeline thus addresses the essential need for extracting figures, subfigures and captions from scientific publications. The systems PDFigCapX and FigSplit are publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX and https://www.eecis.udel.edu/~compbio/FigSplit.

…Read more Less…

Tags