Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Published in ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020

This is a paper for the poster which has been accepted to ACM/IEEE Joint Conference on Digital Libraries 2020 and recieved Best Poster Award Honorable Mention.

Recommended citation: Choudhury, Muntabir Hasan and Wu, Jian and Ingram, William A. and Fox, Edward A.. "A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations." Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. https://lamps-lab.github.io/files/poster-ps316-002.pdf

Page Not Found

Page not found. Your pixels are in another canvas.

About

Acknowledgement

Archive Layout with Content

Posts by Category

Posts by Collection

CV

Markdown

Page not in menu

This is a page not in th emain menu

Overview

Page Archive

People

Metadata Extraction from scanned ETDs

Publications

Resources

Sitemap

Posts by Tags

Talk map

Talks and presentations

Teaching

Terms and Privacy Policy

Blog posts

Jupyter notebook markdown generator

Posts

Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

Published: June 07, 2020

In the previous blog, I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog, we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as GROBID, designed for scholarly papers. However, this article shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly parse the text with patterns. Read More..

OCR Tools Experiment on Scanned Electronic Theses and Dissertations (ETDs)

Published: May 19, 2020

A thesis or dissertation is one type of scholarly work that shows a student pursuing higher education and has successfully met the partial requirement of a degree. An electronic thesis or dissertation can be found from either a university’s electronic theses and dissertations (ETDs) digital library or ProQuest (a third party ETD repository). ETDs contain lots of rich metadata that can be used for searching ETDs from the repository. However, not all ETD metadata are available. Therefore, it is necessary to extract metadata from scholarly ETDs. Also, extracting metadata could be challenging, mainly when it is found as scanned academic ETDs. Although many open-source tools exhibit satisfying performance in certain types of documents, experiments indicate that they tend to produce unacceptable errors or fail on scanned ETDs. In this blog post, I introduce one of the widely used optical character recognition (OCR) tools called tesseract-OCR and show how tesseract-OCR performs on scanned ETDs. Read More..

acknowledgement

overview

Overview

Old Dominion University is responsible for conducting research on born and non-born digital Electronic Theses and Dissertations (ETDs). Dr. Jian Wu is serving as a CO-PI for this research project. At ODU, Dr. Wu is directing Lab for Applied Machine Learning and NLP System (LAMP-SYS) to conduct research on mining scholarly big data, including various challanges in Natural Language Processing, particularly in academic domain. As a CO-PI on this project, Dr. Wu is responsible for data acquisition and analysis. He is currently supervising two graduate research assistant to: collect data, populate ETD databases, perform analysis, information extraction, and classification; and prepare data for downstream training and searching.

people

project

Project

Metadata Extraction from scanned ETDs

by Muntabir Hasan Choudhury

publications

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Published in ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020

This is a paper for the poster which has been accepted to ACM/IEEE Joint Conference on Digital Libraries 2020 and recieved Best Poster Award Honorable Mention.

ETDMiner

Sitemap

Pages

Posts

acknowledgement

overview

people

project

Metadata Extraction from scanned ETDs

publications

resources