如何使用JavaScript将PDF转换为文本(从PDF提取文本)

本文概述

要求
1.包含必需的文件
2.加载PDF
3.从单个页面提取文本
4.从多个页面提取文本
现场例子
例子
文本未检索

在处理可移植文档格式文件(PDF)时, 用户可能希望从PDF文件中提取所有文本。因此, 用户不必使用鼠标选择PDF的所有文本, 然后对其进行处理。

在本文中, 你将学习如何使用pdf.js从带有Javascript的PDF中提取文本。该库是基于Web标准的通用平台, 用于解析和呈现PDF。这个项目使用不同的层, 我们将专门使用2层, 即核心层和显示层。 PDF.js在很大程度上依赖于Promises的使用。如果对你而言是新的承诺, 建议你在继续之前先熟悉它们。 PDF.js由社区驱动, 并由Mozilla Labs支持。

话虽如此, 让我们开始吧！

要求

WebWorkers在浏览器中的可用性(请参阅兼容性表)。
Promises API在pdf.js的几乎所有方法中都经常使用(你可以使用polyfill为过时的浏览器提供支持)。
显然, 是pdf.js的副本(可从此处的网站下载)。

有关pdf.js的更多信息, 请在此处访问官方的Github存储库。

1.包含必需的文件

为了从PDF中提取文本, 你至少需要3个文件(其中2个文件是异步加载的)。如前所述, 我们将使用pdf.js。该库的Prebuilt基于2个文件, 即pdf.js和pdf.worker.js。 pdf.js文件应通过脚本标记包括在内：

<script src="/path/to/pdf.js"></script>

pdf.worker.js应该通过workerSrc方法加载, 该方法需要URL并自动加载。你需要将要转换的PDF的URL存储在一个变量中, 该变量将在以后使用：

<script>
    // Path to PDF file
    var PDF_URL = '/path/to/example.pdf';
    // Specify the path to the worker
    PDFJS.workerSrc = '/path/to/pdf.worker.js';
</script>

使用所需的脚本, 你可以按照以下步骤继续提取PDF文本。

2.加载PDF

继续使用PDFJS的getDocument方法导入要转换为文本的PDF(在pdf.js脚本加载到文档中后全局公开)。 PDF.js的对象结构大致遵循实际PDF的结构。在顶层有一个文档对象。从文档中可以获取更多信息和各个页面。使用以下代码获取PDF文档：

注意

为避免发生CORS问题, 需要从网络文档的同一域中提供PDF(例如, www.yourdomain.com / pdf-to-test.html和www.yourdomain.com/pdffile.pdf)。此外, 你可以直接通过base64将PDF文档加载到文档中, 而无需发出任何请求(请阅读文档)。

var PDF_URL  = '/path/to/example.pdf';

PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
    
    // Use the PDFDocumentInstance To extract the text later

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

PDFDocumentInstance是一个对象, 其中包含有用的方法, 我们将使用这些方法从PDF中提取文本。

3.从单个页面提取文本

从getDocument方法检索的PDFDocumentInstance对象(上一步)使你可以通过一种有用的方法getPage浏览PDF。此方法期望将要处理的PDF页数作为第一个参数, 然后将其(在实现诺言时)作为pdfPage变量返回。从pdfPage, 要实现我们从PDF提取文本的目标, 我们将依赖于getTextContent方法。 pdf页面的getTextContent方法是基于promise的方法, 该方法返回具有2个属性的对象：

项目：数组[X]
样式：物件

我们对存储在items数组中的对象很感兴趣。此数组包含具有以下结构的多个对象(或根据PDF内容仅包含一个)：

{
    "dir":"ltr", "fontName": "g_d0_f2", "height": 8.9664, "width": "227.1458", "str": "When a trace call returns blabla bla ..."
}

你看到有趣的东西了吗？那就对了！该对象包含一个str属性, 该属性具有应绘制到PDF中的文本。要获取页面的所有文本, 你只需要串联所有对象的所有str属性。这就是以下方法的作用, 这是一个简单的基于promise的方法, 该方法在解决时返回页面的串联文本：

重要的反仇恨说明

在开始发表评论之前, 应该避免使用+ =连接字符串, 而要做类似将字符串存储在数组中然后将它们联接的操作, 基于JSPerf的基准, 你应该知道使用+ =是最快的方法, 尽管不一定在每个浏览器中都可以。在这里阅读有关它的更多信息, 如果你不喜欢它, 请根据需要进行修改。

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
 * 
 * @param {Integer} pageNum Specifies the number of the page 
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
 **/
function getPageText(pageNum, PDFDocumentInstance) {
    // Return a Promise that is solved once the text of the page is retrieven
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            // The main trick to obtain the text of the PDF page, use the getTextContent method
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                // Solve promise with the text retrieven from the page
                resolve(finalString);
            });
        });
    });
}

很简单不是吗？现在, 你只需要编写前面描述的代码：

var PDF_URL  = '/path/to/example.pdf';

PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {
    
    var totalPages = PDFDocumentInstance.numPages;
    var pageNumber = 1;

    // Extract the text
    getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
        // Show the text of the page in the console
        console.log(textPage);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

PDF第一页的文本(如果有的话)应显示在控制台中。太棒了！

4.从多个页面提取文本

为了同时提取多个页面的文本, 我们将使用在上一步中创建的相同的getPageText方法, 该方法在提取页面内容时返回一个promise。由于异步可能导致非常严重的误解, 并且为了正确检索文本, 我们将使用Promise一次触发多个promise.all允许你按提供它们作为参数的顺序来同时解决多个promise (这将有助于控制先执行的Promise问题), 并分别以相同的顺序在数组中检索结果：

var PDF_URL = '/path/to/example.pdf';

PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {

    var pdfDocument = pdf;
    // Create an array that will contain our promises 
    var pagesPromises = [];

    for (var i = 0; i < pdf.numPages; i++) {
        // Required to prevent that i is always the total of pages
        (function (pageNumber) {
            // Store the promise of getPageText that returns the text of a page
            pagesPromises.push(getPageText(pageNumber, pdfDocument));
        })(i + 1);
    }

    // Execute all the promises
    Promise.all(pagesPromises).then(function (pagesText) {

        // Display text of all the pages in the console
        // e.g ["Text content page 1", "Text content page 2", "Text content page 3" ... ]
        console.log(pagesText);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

现场例子

播放以下小提琴, 它将提取此PDF所有页面的内容, 并将它们作为文本附加到DOM(转到”结果”选项卡)：

例子

以下文档包含一个非常简单的示例, 该示例将在控制台中显示PDF每页的内容。你只需要在http服务器上实现它, 添加pdf.js和pdf.worker.js, 一个要测试的PDF就是这样：

<!DOCTYPE html>
<html lang="en">

<head>
    <title></title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
</head>

<body>
    <h1>PDF.js</h1>

    <script src="/path/to/pdf.js"></script>
    <script>
        var urlPDF = '/path/to/example.pdf';
        PDFJS.workerSrc = '/path/to/pdf.worker.js';

        PDFJS.getDocument(urlPDF).then(function (pdf) {
            var pdfDocument = pdf;
            var pagesPromises = [];

            for (var i = 0; i < pdf.numPages; i++) {
                // Required to prevent that i is always the total of pages
                (function (pageNumber) {
                    pagesPromises.push(getPageText(pageNumber, pdfDocument));
                })(i + 1);
            }

            Promise.all(pagesPromises).then(function (pagesText) {

                // Display text of all the pages in the console
                console.log(pagesText);
            });

        }, function (reason) {
            // PDF loading error
            console.error(reason);
        });


        /**
         * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
         * 
         * @param {Integer} pageNum Specifies the number of the page 
         * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
         **/
        function getPageText(pageNum, PDFDocumentInstance) {
            // Return a Promise that is solved once the text of the page is retrieven
            return new Promise(function (resolve, reject) {
                PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
                    // The main trick to obtain the text of the PDF page, use the getTextContent method
                    pdfPage.getTextContent().then(function (textContent) {
                        var textItems = textContent.items;
                        var finalString = "";

                        // Concatenate the string of the item to the final string
                        for (var i = 0; i < textItems.length; i++) {
                            var item = textItems[i];

                            finalString += item.str + " ";
                        }

                        // Solve promise with the text retrieven from the page
                        resolve(finalString);
                    });
                });
            });
        }
    </script>
</body>

</html>

文本未检索

如果你已经尝试过该代码, 但没有获得任何文本, 那是因为你的pdf可能没有任何文本。你可能看不到的PDF文本不是文本而是图像, 因此此过程中说明的过程将无济于事。你可以使用其他方法, 例如光学字符识别(OCR), 但是不建议在客户端而是在服务器端进行此操作(请参阅OCR的Node.js用法或Symfony中的PHP)。

编码愉快！

本文概述

要求

1.包含必需的文件

2.加载PDF

注意

3.从单个页面提取文本

重要的反仇恨说明

4.从多个页面提取文本

现场例子

例子

文本未检索

相关推荐

评论抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

本文概述

要求

1.包含必需的文件

2.加载PDF

注意

3.从单个页面提取文本

重要的反仇恨说明

4.从多个页面提取文本

现场例子

例子

文本未检索

相关推荐

评论 抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

评论抢沙发