Working with PDF documents programmatically can be a challenging task, especially when you need to extract and manipulate text content. However, with the right tools and libraries, you can efficiently convert PDF text to a structured JSON format.

Converting PDF to JSON programmatically offers flexibility and customization, especially in dynamic runtime environments where reliance on external tools may not be feasible. While free tools exist, they may not always cater to specific runtime requirements or integrate seamlessly into existing systems.

Consider scenarios like real-time data extraction from PDF reports generated by various sources. During runtime, integrating with a specific tool might not be viable due to constraints such as security policies, network connectivity, or the need for real-time processing. In such cases, a custom-coded solution allows for on-the-fly conversion tailored to the application’s needs.

For Example:

  • E-commerce Invoice Processing: Extracting invoice details and converting them to JSON for real-time database updates.
  • Healthcare Records Management: Converting patient records to JSON for integration with EHR systems, ensuring HIPAA compliance.
  • Legal Document Analysis: Extracting specific clauses and dates from legal documents for analysis.

Free tools are inadequate for real-time, automated, and secure PDF to JSON conversion. Coding your own solution ensures efficient, scalable, and compliant data handling.

In this blog, we’ll walk through a Java program that accomplishes using the powerful iTextPDF and Jackson libraries. Screenshots will be included to illustrate the process in Testing.

Introduction for Converting PDF to JSON in Java

PDF documents are ubiquitous in the modern world, used for everything from reports and ebooks to invoices and forms. They provide a versatile way to share formatted text, images, and even interactive content. Despite their convenience, PDFs can be difficult to work with programmatically, especially when you need to extract specific information from them.

Often, there arises a need to extract text content from PDFs for various purposes such as:

  • Data Analysis: Extracting textual data for analysis, reporting, or further processing.
  • Indexing: Creating searchable indexes for large collections of PDF documents.
  • Transformation: Converting PDF content into different formats like JSON, XML, or CSV for interoperability with other systems.

JSON (JavaScript Object Notation) is a lightweight data interchange format that’s easy for humans to read and write, and easy for machines to parse and generate. It is widely used in web applications, APIs, and configuration files due to its simplicity and versatility.

In this guide, we will explore how to convert the text content of a PDF file into a JSON format using Java. We’ll leverage the iTextPDF library for PDF text extraction and the Jackson library for JSON processing. This approach will allow us to take advantage of the structured nature of JSON to organize the extracted text in a meaningful way.

Prerequisites for Converting PDF to JSON in Java

Before we dive into the code, ensure you have the following prerequisites installed and configured:

  1. Java Development Kit (JDK)
  2. Maven for managing dependencies
  3. iTextPDF library for handling PDF documents
  4. Jackson library for JSON processing

Step-by-Step Installation and Setup for Converting PDF to JSON in Java

Install Java Development Kit (JDK)

The JDK is a software development environment used for developing Java applications. To install the JDK:

  • Visit the Oracle JDK download page.
  • Download the appropriate installer for your operating system (Windows, macOS, or Linux).
  • Follow the installation instructions provided on the website.

Verify the installation by opening a command prompt or terminal and typing:

java -version

You should see output indicating the version of Java installed.

Convert pdf to json - 1

Install Maven

Maven is a build automation tool used primarily for Java projects. It helps manage project dependencies and build processes. To install Maven:

  • Visit the Maven download page.
  • Download the appropriate archive file for your operating system.
  • Extract the archive to a directory of your choice.
  • Add the bin directory of the extracted Maven folder to your system’s PATH environment variable.

Verify the installation by opening a command prompt or terminal and typing:

mvn -version

maven version

Download IntelliJ IDEA

  1. Visit the Official Website: Go to the JetBrains IntelliJ IDEA download page.
  2. Step 2: Install IntelliJ IDEA on Windows
  3. Start IntelliJ IDEA: Open from the start menu (Windows).
  4. Complete Initial Setup: Import settings or start fresh.
  5. Start a New Project: Begin a new project or open an existing one.

Open IntelliJ IDEA:

Launch IntelliJ IDEA on your computer

Create or Open a Project

  • If you already have a project, open it. Otherwise, create a new project by selecting File > New > Project….
  • Name your project and select the project location
  • Choose Java from Language.
  • Choose Maven from the Build systems.
  • Select the project SDK (JDK) and click Next.
  • Choose the project template (if any) and click Next.
  • Then click Create.
Open project to convert pdf to json

Create a New Java Class

  • In the Project tool window (usually on the left side), right-click on the (src → test → java) directory or any of its subdirectories where you want to create the new class.
  • Select New > Java Class from the context menu.

Name Your Class

  • In the dialog that appears, enter the name of your new class. For example, you can name it PdfToJsonConversion.
  • Click OK/Enter.
pdf to json conversion
java file

Add the following dependencies to your pom.xml file for Converting PDF to JSON in Java:

json file

Write Your Code to Convert PDF to JSON in Java

  • IntelliJ IDEA will create a new .java file with the name you provided.
  • You can start writing your Java code inside this file. 

The Java Program to Covert PFT to JSON

Here is the complete Java program that converts a PDF file to JSON:

testing.json file

Explanation

Let’s break down the code step by step:

1. Dependencies

Jackson Library:

ObjectMapper, SerializationFeature, ArrayNode, ObjectNode: These are from the Jackson library, used for creating and manipulating JSON objects.

iText Library:

PdfDocument, PdfPage, PdfReader, PdfTextExtractor: These classes are from the iText library, used for reading and extracting text from PDF documents.

TestNG Library:

@Test: An annotation from the TestNG library, used for marking the convertPdfFileToJson method as a test method.

Java Standard Library:

File, IOException, ArrayList, List: Standard Java classes for file operations, handling exceptions, and working with lists.

2. Test Annotation

The class PdfToJsonConversion contains a static method convertPdfFileToJson which is annotated with @Test, making it a test method in a TestNG test class.

3. Method convertPdfFileToJson:

This method handles the core functionality of reading a PDF and converting its content to JSON.

4. Input and Output Paths:

inputPdfPath specifies the PDF file location, and outputJsonPath defines where the resulting JSON file will be saved.

5. PDF to Text Conversion:

  • Create a PdfDocument object using a PdfReader for the input PDF file.
  • Get the number of pages in the PDF.
  • Loop through each page, extract text using PdfTextExtractor, and add the text to contentList.
  • Handle any IOException that may occur.

6. Creating JSON Objects:

  • Create an ObjectMapper for JSON manipulation.
  • Enable pretty printing with SerializationFeature.INDENT_OUTPUT.
  • Create an ArrayNode to hold the content of each page.

7. Adding Page Content to JSON:

  • Iterate over contentList to process each page’s content.
  • For each page, create an ObjectNode and set the page number.
  • Split the page content into lines, then create another ObjectNode to hold each line with its number as the key.
  • Add the linesObject to the pageNode and then add the pageNode to pagesArray.

8. Writing JSON to File

  • Create a File object for the output JSON file.
  • Use the ObjectMapper to write pagesArray to the JSON file, handling any IOException.
  • Print a confirmation message indicating the completion of the process.

9. Output

The program outputs the name of the JSON file once the conversion is complete.

Running the Program

To run this program, ensure you have the required libraries in your project’s classpath. You can run it through your IDE or using a build tool like Maven.

  1. Open your IDE and load the project.
  2. Ensure dependencies are correctly set in your pom.xml.
  3. Run the test method convertPdfFileToJson.

You should see output similar to this in your console: Content stored in What is Software Testing.json. The JSON file will be created in the specified output path.

JSON Output Example

Here’s a snippet of what the JSON output might look like.

Output

Conclusion

Converting PDF text content to JSON can greatly simplify data processing and integration tasks. With Java, the iTextPDF, and Jackson libraries, this task becomes straightforward and efficient. This guide provides a comprehensive example to help you get started with your own PDF to JSON conversion projects.
https://github.com/mangesh-31/PdfToJsonConversion

Click here to read more blog like this.

0