Java iText Read PDF Metadata Example Tutorial

This post of the example tutorial series describes how to read Metadata from a PDF document using Java iText library. For those, who are beginners to the concept of Metadata, a small definition is provided below to get started;
Metadata is a structured collection of data that describes the characteristics of a PDF document. In PDF terms, this includes Document Title, Author, Subject and so on.
Now, when you create a PDF document it is important to add these metadata information. You may also find yourself in situations where you have to read the metadata on an existing PDF document for various reasons. This post deals with a case where you have to access existing metadata on a PDF file using Java. We will present a working code example that will take you through this tutorial as a step by step process to read metadata. To start with, I have created a basic PDF document and added some Document information which can be seen in the screenshot below; [ in Adobe Reader you can access this using File -> Document Properties ]
PDF Metadata as seen from Adobe Reader
PDF Metadata as seen from Adobe Reader
So, let us discuss how to copy this information on an existing PDF document using iText. We will use PdfReader object for this.
Step-1: Use PdfReader object to read the incoming PDF document. Refer to the code below that explains how to do this;
          PdfReader ReadInputPDF;
          ReadInputPDF = new PdfReader("sample.pdf");
Step-2: Use the "getMetadata" method of PdfReader object to read the structure information of a PDF file into a byte array.Once you have this byte array, you can either print this on the screen or write it to a XML file. Both these approaches are shown below;
          byte Document_MetaData[]=ReadInputPDF.getMetadata();
          /* dumping metadata on the screen */
          String strFileContent = new String(Document_MetaData); 
          System.out.println("File content : ");
          System.out.println(strFileContent);
          /* writing metadata into an xml file */
          FileOutputStream fos = new FileOutputStream("test.xml");
          fos.write(Document_MetaData);
          fos.close();
If you open this XML file, you can find all the metadata wrapped inside the document. The contents of the XML file in my case is shown below. This could be different based on your PDF file but the structure of this XML would be the same.
<?xpacket begin="" id="W5M0MpCehiHzrczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:ModifyDate>2011-05-11T20:32:21+10:00</xmp:ModifyDate>
         <xmp:CreateDate>2011-05-11T20:29:10+10:00</xmp:CreateDate>
         <xmp:MetadataDate>2011-05-11T20:32:21+10:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Test_PDF_Metadata</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Bag>
               <rdf:li>Thinktibits</rdf:li>
            </rdf:Bag>
         </dc:creator>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">PDF Metadata</rdf:li>
            </rdf:Alt>
         </dc:description>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>PDF</rdf:li>
               <rdf:li>Read Metadata</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:rights>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">NA</rdf:li>
            </rdf:Alt>
         </dc:rights>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>uuid:a3ecde-4e3b-82f5-919f13e589c4</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:4ae-09c4-4a02-9c58-87b6382179bf</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Acrobat Web Capture 9.0</pdf:Producer>
         <pdf:Keywords>PDF; Read Metadata</pdf:Keywords>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
         <photoshop:AuthorsPosition>Mr</photoshop:AuthorsPosition>
         <photoshop:CaptionWriter>Typist</photoshop:CaptionWriter>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
         <xmpRights:Marked>False</xmpRights:Marked>
         <xmpRights:WebStatement>NA</xmpRights:WebStatement>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:illustrator="http://ns.adobe.com/illustrator/1.0/">
         <illustrator:StartupProfile>Print</illustrator:StartupProfile>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
         <pdfx:Custom_Metadata>Test1</pdfx:Custom_Metadata>
         <pdfx:Custom_Metadata2>Test2</pdfx:Custom_Metadata2>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
If you want to read a specific metadata tag, you can parse this XML to a DOM object using DOM4J or equivalent and get going. All the information that Adobe Reader shows on the screen are picked up from this XML which is stored as a component of your PDF file. The complete Java code that explains how to read the Metadata of a PDF file using Java iText is shown below;

import java.io.*;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
public class ReadPDFMetaData{  
     public static void main(String[] args){
        try {
          PdfReader ReadInputPDF;
          ReadInputPDF = new PdfReader("sample.pdf");
          byte Document_MetaData[]=ReadInputPDF.getMetadata();
          /* dumping metadata on the screen */
          String strFileContent = new String(Document_MetaData); 
          System.out.println("File content : ");
          System.out.println(strFileContent);
          /* writing metadata into an xml file */
          FileOutputStream fos = new FileOutputStream("test.xml");
          fos.write(Document_MetaData);
          fos.close();
          }         
        catch (Exception i)
        {
            System.out.println(i);
        }
    }
}
In the upcoming post, we will discuss how to add this Metadata information to PDF files using iText.
The Metadata is stored using Adobe's XMP (Extensible Metadata Platform) technology. You can find more about that in Adobe's website

No comments:

Post a Comment