unpdf
Utilities to work with PDFs, like extracting text
README
unpdf
A collection of utilities to work with PDFs. Uses Mozilla's PDF.js under the hood and lazily initializes the library.
unpdf takes advantage of export conditions to circumvent build issues in serverless environments. For example, PDF.js depends on the optionalcanvas module, which doesn't work inside worker threads.
This library is also intended as a modern alternative to the unmaintained but still popular [pdf-parse](https://www.npmjs.com/package/pdf-parse).
Features
- 🏗️ Conditional exports for Node.js, worker and browser environments
- 💬 Extract text and images from PDFs
- 🧱 Opt-in to legacy PDF.js build
Installation
Run the following command to add unpdf to your project.
- ```bash
- # pnpm
- pnpm add unpdf
- # npm
- npm install unpdf
- # yarn
- yarn add unpdf
- ```
Usage
- ```ts
- import { extractPDFText } from 'unpdf'
- // Fetch a PDF file from the web
- const pdf = await fetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')
- .then(res => res.arrayBuffer())
- // Or load it from the filesystem
- const pdf = await readFile('./dummy.pdf')
- // Pass the PDF buffer to the relevant method
- const { totalPages, text } = await extractPDFText(
- new Uint8Array(pdf), { mergePages: true }
- )
- ```
Use Legacy Or Custom PDF.js Build
- ```ts
- // Before using any other methods, define the PDF.js module
- import { defineUnPDFConfig } from 'unpdf'
- // Use the legacy build
- defineUnPDFConfig({
- pdfjs: () => import('pdfjs-dist/legacy/build/pdf.js')
- })
- // Now, you can use the other methods
- // …
- ```
Access the PDF.js Module
- ```ts
- import { getResolvedPDFJS } from 'unpdf'
- const { version } = await getResolvedPDFJS()
- ```
Config
- ```ts
- interface UnPDFConfiguration {
- /**
- * By default, UnPDF will use the latest version of PDF.js. If you want to
- * use an older version or the legacy build, set a promise that resolves to
- * the PDF.js module.
- *
- * @example
- * () => import('pdfjs-dist/legacy/build/pdf.js')
- */
- pdfjs?: () => Promise<typeof PDFJS>
- }
- ```
Methods
defineUnPDFConfig
Define a custom PDF.js module, like the legacy build. Make sure to call this method before using any other methods.
- ```ts
- function defineUnPDFConfig(config: UnPDFConfiguration): Promise<void>
- ```
getResolvedPDFJS
Returns the resolved PDF.js module. If no build is defined, the latest version will be initialized.
- ```ts
- function getResolvedPDFJS(): Promise<typeof import('pdfjs-dist')>
- ```
getPDFMeta
- ```ts
- function getPDFMeta(
- data: BinaryData | PDFDocumentProxy
- ): Promise<{
- info: Record<string, any>
- metadata: Record<string, any>
- }>
- ```
extractPDFText
- ```ts
- function extractPDFText(
- data: BinaryData | PDFDocumentProxy,
- { mergePages }?: { mergePages?: boolean }
- ): Promise<{
- totalPages: number
- text: string | string[]
- }>
- ```
getImagesFromPage
- ```ts
- function getImagesFromPage(
- data: BinaryData | PDFDocumentProxy,
- pageNumber: number
- ): Promise<ArrayBuffer[]>
- ```
License
MIT License © 2023-PRESENT Johann Schopplich
探客时代
