如何在Symfony 3中使用PHP将PDF转换为文本(从PDF提取文本)

本文概述

1.安装PDF分析器
2.提取文本

如果你使用可移植文档格式文件(PDF), 则系统用户可能希望从PDF文件中提取所有文本。因此, 用户不必用鼠标选择PDF的所有文本, 然后对其进行操作, 因为你可以在浏览器中使用JavaScript自动执行此操作。如果你不关心用户体验, 而不想使用JavaScript在浏览器中提取PDF文本, 那么你可能希望在服务器端进行操作。

在本文中, 你将学习如何使用PDF Parser库在Symfony 3项目中使用PHP从服务器端的PDF中提取文本。尽管还有其他库可以帮助你通过@spatie提取pdf-to-text之类的文本, 也可以像魅力一样工作, 但是PDF Parser是一种更好的处理方式, 因为它非常容易安装, 使用和不使用具有任何软件依赖性(如果你通过spatie使用pdf-to-text库, 则你将需要在计算机中安装pdftotext, 因为该库是实用程序的包装)。

让我们开始吧！

1.安装PDF分析器

PdfParser是一个很棒的独立PHP库, 它提供了多种工具来从PDF文件提取数据。 PDF解析器的某些功能包括：

加载/解析对象和标题
提取元数据(作者, 描述等)
从有序页面中提取文本
支持压缩的pdf
支持MAC OS罗马字符集编码
文本部分中十六进制和八进制编码的处理
符合PSR-0(自动装带器)
符合PSR-1(代码样式)

你甚至可以在此页面上测试库的工作方式。该解析器的唯一限制是它不能处理受保护的文档。

安装此库的首选方法是通过Composer。打开一个新终端, 切换到项目目录, 并在其上执行以下命令：

composer require smalot/pdfparser

如果你不希望直接通过终端在项目上安装新库, 则仍然可以修改composer.json文件并手动添加依赖项：

{
    "require": {
        "smalot/pdfparser": "*"
    }
}

保存更改, 然后在终端中执行composer安装。安装完成后, 你将可以轻松地从PDF中提取文本。

如果你需要有关PDF解析器库的更多信息, 请访问Github上的官方存储库或此处的网站。

2.提取文本

使用PDFParse提取文本非常容易, 你只需要创建Smalot \ PdfParser \ Parser类的实例, 然后从其绝对路径或相对路径加载PDF文件, 解析的文件应存储在一个变量中, 然后对象将允许你按页面处理PDF。你可以直接从整个PDF中提取所有文本, 也可以按页面分别提取。

查看以下示例：

注意

在使用symfony时, 可以使用$ this-> get(‘kernel’)-> getRootDir()检索项目中/ web文件夹的路径。只要你在控制器内, 就可以使用” /../web”。

从所有页面提取所有文本

你可以使用PDF实例中可用的getText方法从PDF中提取所有文本：

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);
        
        // Extract ALL text with the getText method
        $text = $pdf->getText();

        // Send the text as response in the controller
        return new Response($text);
    }
}

遍历PDF的每一页并提取文本

如果要分别处理PDF的每个页面, 则可以遍历可使用PDF实例的getPages方法检索的页面数组：

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);

        // Retrieve all pages from the pdf file.
        $pages  = $pdf->getPages();

        // Retrieve the number of pages by counting the array
        $totalPages = count($pages);

        // Set the current page as the first (a counter)
        $currentPage = 1;

        // Create an empty variable that will store thefinal text
        $text = "";
         
        // Loop over each page to extract the text
        foreach ($pages as $page) {

            // Add a HTML separator per page e.g Page 1/14
            $text .= "<h3>Page $currentPage/$totalPages</h3> </br>";

            // Concatenate the text
            $text .= $page->getText();

            // Increment the page counter
            $currentPage++;
        }
 
        // Send the text as response in the controller
        return new Response($text);
    }
}

你可以使用getTextArray方法而不是getText从数组格式的页面中检索文本(数组中的每个项目都是新行)。

从PDF中的特定页面提取文本

尽管没有方法可以直接按页面编号访问页面, 但是你可以使用PDF实例的getPages方法直接在页面数组中访问页面。该数组的排序方式与PDF相同(索引0等于PDF的页面＃1), 因此你可以通过从带有索引的数组中检索页面来访问该页面。

请注意, 你需要验证pages数组中的索引(页数)是否存在, 否则将出现异常：

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);

        // Get all the pages of the PDF
        $pages = $pdf->getPages();
        
        // Let's extract the text of the page #2 of the PDF
        $customPageNumber = 2;

        // If the page exist, then extract the text
        // As every array starts with 0 add +1
        if(isset($pages[$customPageNumber + 1])){
          
            // As every array starts with 0 add +1
            $pageNumberTwo = $pdf->getPages()[$customPageNumber + 1];

            // Extract the text of the page #2
            $text = $pageNumberTwo->getText();

            // Send the text as response in the controller
            return new Response($text);

        }else{
            return new Response("Sorry the page #$customPageNumber doesn't exist");
        }
    }
}

编码愉快！

本文概述

1.安装PDF分析器

2.提取文本

注意

从所有页面提取所有文本

遍历PDF的每一页并提取文本

从PDF中的特定页面提取文本

相关推荐

评论抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

本文概述

1.安装PDF分析器

2.提取文本

注意

从所有页面提取所有文本

遍历PDF的每一页并提取文本

从PDF中的特定页面提取文本

相关推荐

评论 抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

评论抢沙发