Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Published in ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020

This is a paper for the poster which has been accepted to ACM/IEEE Joint Conference on Digital Libraries 2020 and recieved Best Poster Award Honorable Mention.

Recommended citation: Choudhury, Muntabir Hasan and Wu, Jian and Ingram, William A. and Fox, Edward A.. "A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations." Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. https://lamps-lab.github.io/files/poster-ps316-002.pdf

Posts

Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

Published:

In the previous blog, I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog, we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as GROBID, designed for scholarly papers. However, this article shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly parse the text with patterns. Read More..

OCR Tools Experiment on Scanned Electronic Theses and Dissertations (ETDs)

Published:

A thesis or dissertation is one type of scholarly work that shows a student pursuing higher education and has successfully met the partial requirement of a degree. An electronic thesis or dissertation can be found from either a university’s electronic theses and dissertations (ETDs) digital library or ProQuest (a third party ETD repository). ETDs contain lots of rich metadata that can be used for searching ETDs from the repository. However, not all ETD metadata are available. Therefore, it is necessary to extract metadata from scholarly ETDs. Also, extracting metadata could be challenging, mainly when it is found as scanned academic ETDs. Although many open-source tools exhibit satisfying performance in certain types of documents, experiments indicate that they tend to produce unacceptable errors or fail on scanned ETDs. In this blog post, I introduce one of the widely used optical character recognition (OCR) tools called tesseract-OCR and show how tesseract-OCR performs on scanned ETDs. Read More..

acknowledgement

overview

Overview

Old Dominion University is responsible for conducting research on born and non-born digital Electronic Theses and Dissertations (ETDs). Dr. Jian Wu is serving as a CO-PI for this research project. At ODU, Dr. Wu is directing Lab for Applied Machine Learning and NLP System (LAMP-SYS) to conduct research on mining scholarly big data, including various challanges in Natural Language Processing, particularly in academic domain. As a CO-PI on this project, Dr. Wu is responsible for data acquisition and analysis. He is currently supervising two graduate research assistant to: collect data, populate ETD databases, perform analysis, information extraction, and classification; and prepare data for downstream training and searching.

people

project

Project

Metadata Extraction from scanned ETDs

by Muntabir Hasan Choudhury

publications

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Published in ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020

This is a paper for the poster which has been accepted to ACM/IEEE Joint Conference on Digital Libraries 2020 and recieved Best Poster Award Honorable Mention.

Recommended citation: Choudhury, Muntabir Hasan and Wu, Jian and Ingram, William A. and Fox, Edward A.. "A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations." Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. https://lamps-lab.github.io/files/poster-ps316-002.pdf

resources