Extracting information from PDF documents for use in automatic indexing of e-books

Registro completo de metadados
MetadadosDescriçãoIdioma
Autor(es): dc.contributorUniv Murcia-
Autor(es): dc.contributorUniversidade Estadual Paulista (UNESP)-
Autor(es): dc.contributorUniv Fed Para-
Autor(es): dc.creatorGil-leiva, Isidoro-
Autor(es): dc.creatorFujita, Mariangela Spotti Lopes-
Autor(es): dc.creatorRedigolo, Franciele Marques-
Autor(es): dc.creatorSaran, Jordan Ferreira-
Data de aceite: dc.date.accessioned2025-08-21T19:52:15Z-
Data de disponibilização: dc.date.available2025-08-21T19:52:15Z-
Data de envio: dc.date.issued2022-11-29-
Data de envio: dc.date.issued2022-11-29-
Data de envio: dc.date.issued2021-12-31-
Fonte completa do material: dc.identifierhttp://dx.doi.org/10.1590/2318-0889202234e210069-
Fonte completa do material: dc.identifierhttp://hdl.handle.net/11449/237794-
Fonte: dc.identifier.urihttp://educapes.capes.gov.br/handle/11449/237794-
Descrição: dc.descriptionThe number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation oftools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation offive software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.-
Descrição: dc.descriptionUniv Murcia, Fac Comunicac & Documentac, Campus Univ Espinardo s n, Murcia 30100, Spain-
Descrição: dc.descriptionUniv Estadual Paulista, Fac Filosofia & Ciencias, Programa Posgrad Ciencia Informacao, Marilia, SP, Brazil-
Descrição: dc.descriptionUniv Fed Para, Fac Bibliotecon, Programa Posgrad Ciencia Informacao, Belem, PA, Brazil-
Descrição: dc.descriptionUniv Estadual Paulista, Fac Filosofia & Ciencias, Programa Posgrad Ciencia Informacao, Marilia, SP, Brazil-
Formato: dc.format11-
Idioma: dc.languageen-
Publicador: dc.publisherPontificia Universidade Catolica Campinas-
Relação: dc.relationTransinformacao-
???dc.source???: dc.sourceWeb of Science-
Palavras-chave: dc.subjectSoftware evaluation-
Palavras-chave: dc.subjectPDFMiner-
Palavras-chave: dc.subjectsix-
Palavras-chave: dc.subjectPDFAct-
Palavras-chave: dc.subjectPDF-extract-
Palavras-chave: dc.subjectPDFExtract-
Palavras-chave: dc.subjectGrobib-
Palavras-chave: dc.subjectAutomatic indexing-
Título: dc.titleExtracting information from PDF documents for use in automatic indexing of e-books-
Tipo de arquivo: dc.typelivro digital-
Aparece nas coleções:Repositório Institucional - Unesp

Não existem arquivos associados a este item.