Extracting Text from PDF using PHP

PDF (Portable Doc Format) files allow you to store text and image data for offline usage. To show text and graphics online, utilize a PDF file. To embed PDF files in the browser, utilize a web viewer. The text and graphic material are not included in the PDF file that is embedded on a webpage. The inability to render PDF content on the page has an impact on SEO. To get over this issue, extract text from PDF and upload it to the website.

PHP may be used to extract elements from PDF files using the PDF Parser module. This PHP library parses PDF files and extracts the text content from every page. Text, headers, and metadata can all be extracted from the PDF file using PHP. This tutorial will show you how to use PHP to extract text from PDF files.

INSTRUCTION

You can use this sample script to see how to utilize PHP's PDF Parser module to extract text from PDF files. Additionally, we'll demonstrate how to use PHP to upload PDF files and extract data instantly.

Install PDF Parser Library

To install the PDF Parser library alongside the composer, use the following line.

composer require smalot/pdfparser

It should be noted that all necessary files are included within the code source, so you don't need to install the PDF Parser library separately. If you want to set up and run PDF Parser with a composer, you can get the source code.

Include an autoloader in a PHP script to load the PDF Parser library and utility functions.

include 'vendor/autoload.php';

Extract Text from PDF

The PHP code snippet that follows pulls all of the text from a PDF file.

Open and load the PDF Parser library.
The source PDF file from which the text content will be retrieved must be specified.
Use the PDF Parser class parseFile() function to analyze a PDF file.
Utilize the getText() function of the PDF Parser class to extract text from PDF files.

<?php
$parser = new SmalotPdfParserParser();
$PDFfile = 'test.pdf';
$PDF = $parser->parseFile($PDFfile);
$PDFContent = $PDF->getText();
echonl2br($PDFContent);
?>

You can explore more features by viewing the PDF Parser library documentation here.

Upload PDF File and Extract Text

This snippet of code demonstrates how to use PHP to upload PDFs and extract the text from them. Define the HTML elements used in forms for file uploads.

<form action="parse.php" method="POST" enctype="multipart/form-data">
<div class="pdf-input"> 
<label for="pdf">PDF File</label> 
<input type="file" id="pdf" name="pdf" placeholder="Select a PDF file" required=""> 
</div> 
<input type="submit" name="submit" class="btn btn-large" value="Submit">
</form>

The chosen file is uploaded to the server script for further processing when the form is submitted.

Server-side script (parse.php) to extract text from PDF File:

You can upload the file and extract the data from the PDF using the code below.

In PHP, use "$_FILES" to retrieve the file's name.
Use the Pathinfo() function with the PATHINFO EXTENSION Filter to extend the file.
Check the file to make sure it's a legitimate PDF file.
Find the path to the file by using tmp_name inside $_FILES.
Use the pdf Parser library to parse the PDF file you just uploaded and extract the text content.
Using PHP's nl2br() function, format text by swapping out newlines (n) for line breaks (<br>).

$PDFContent = '';
if(isset($_POST['submit'])){
if(!empty($_FILES["pdf"]["name"])){
$PDFfileName = basename($_FILES["pdf"]["name"]);
$PDFfileType = pathinfo($PDFfileName, PATHINFO_EXTENSION);
$allowTypes = array('pdf');
if(in_array($PDFfileType, $allowTypes)){
include 'vendor/autoload.php';
$parser = new SmalotPdfParserParser();
// Source file
$PDFfile = $_FILES["pdf"]["tmp_name"];
$PDF = $parser->parseFile($PDFfile);
$fileText = $PDF->getText();
// line break
$PDFContent = nl2br($fileText);
}
else
{
$PDFContent = '<p>only PDF file is allowed to upload.</p>';
}
}
else
{
$PDFContent = '<p>Please select a file.</p>';
}
}
// Display content
echo $PDFContent;