Documentation
¶
Overview ¶
Package gxpdf provides a modern, enterprise-grade PDF library for Go.
GxPDF is designed to be the reference PDF library for Go applications, offering simple API for common tasks while providing full power for advanced use cases.
Quick Start ¶
Open a PDF and extract tables:
doc, err := gxpdf.Open("invoice.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
tables := doc.ExtractTables()
for _, table := range tables {
fmt.Println(table.Rows())
}
Architecture ¶
The library follows modern Go best practices (2025+):
- Root package for core API (gxpdf.Open, gxpdf.Document, gxpdf.Table)
- Subpackages for specialized functionality (export/, creator/)
- Internal packages for implementation details
Features ¶
- PDF reading and parsing
- Table extraction with 4-Pass Hybrid detection (100% accuracy on bank statements)
- Text extraction with position information
- Export to CSV, JSON, Excel
- PDF creation (coming soon)
Thread Safety ¶
Document instances are safe for concurrent read operations. Write operations should be synchronized by the caller. For PDF creation, use the creator package - each Creator instance should be used from a single goroutine.
Index ¶
- Constants
- Variables
- func IsCorrupted(err error) bool
- func IsEncrypted(err error) bool
- type Document
- func (d *Document) Author() string
- func (d *Document) Close() error
- func (d *Document) Creator() string
- func (d *Document) ExtractTables() []*Table
- func (d *Document) ExtractTablesFromPage(pageNum int) []*Table
- func (d *Document) ExtractTablesWithOptions(opts *ExtractionOptions) ([]*Table, error)
- func (d *Document) ExtractTextFromPage(pageNum int) (string, error)
- func (d *Document) GetImages() []*Image
- func (d *Document) GetImagesWithError() ([]*Image, error)
- func (d *Document) Info() *DocumentInfo
- func (d *Document) IsEncrypted() bool
- func (d *Document) Keywords() string
- func (d *Document) Page(index int) *Page
- func (d *Document) PageCount() int
- func (d *Document) Pages() []*Page
- func (d *Document) Path() string
- func (d *Document) Producer() string
- func (d *Document) Subject() string
- func (d *Document) Title() string
- func (d *Document) Version() string
- type DocumentInfo
- type ExtractionMethod
- type ExtractionOptions
- type Image
- func (img *Image) BitsPerComponent() int
- func (img *Image) ColorSpace() string
- func (img *Image) Filter() string
- func (img *Image) Height() int
- func (img *Image) Name() string
- func (img *Image) SaveToFile(path string) error
- func (img *Image) String() string
- func (img *Image) ToGoImage() (image.Image, error)
- func (img *Image) Width() int
- type Page
- func (p *Page) ExtractTables() []*Table
- func (p *Page) ExtractTablesWithOptions(opts *ExtractionOptions) ([]*Table, error)
- func (p *Page) ExtractText() string
- func (p *Page) GetImages() []*Image
- func (p *Page) GetImagesWithError() ([]*Image, error)
- func (p *Page) Index() int
- func (p *Page) Number() int
- type Table
- func (t *Table) Cell(row, col int) string
- func (t *Table) ColumnCount() int
- func (t *Table) ExportCSV(w io.Writer) error
- func (t *Table) ExportExcel(w io.Writer) error
- func (t *Table) ExportJSON(w io.Writer) error
- func (t *Table) Internal() *internaltable.Table
- func (t *Table) IsEmpty() bool
- func (t *Table) Method() string
- func (t *Table) PageNumber() int
- func (t *Table) RowCount() int
- func (t *Table) Rows() [][]string
- func (t *Table) String() string
- func (t *Table) ToCSV() (string, error)
- func (t *Table) ToJSON() (string, error)
Constants ¶
const Version = "0.1.0-alpha"
Version is the current version of the gxpdf library.
Variables ¶
var ( // ErrInvalidPDF is returned when the file is not a valid PDF. ErrInvalidPDF = errors.New("gxpdf: invalid PDF file") // ErrEncrypted is returned when the PDF is encrypted and no password was provided. ErrEncrypted = errors.New("gxpdf: PDF is encrypted") // ErrWrongPassword is returned when the provided password is incorrect. ErrWrongPassword = errors.New("gxpdf: wrong password") // ErrCorrupted is returned when the PDF structure is corrupted. ErrCorrupted = errors.New("gxpdf: PDF file is corrupted") // ErrPageNotFound is returned when the requested page does not exist. ErrPageNotFound = errors.New("gxpdf: page not found") // ErrNoTables is returned when no tables were found on the page. ErrNoTables = errors.New("gxpdf: no tables found") // ErrUnsupportedFeature is returned for PDF features not yet implemented. ErrUnsupportedFeature = errors.New("gxpdf: unsupported PDF feature") )
Common errors returned by gxpdf functions.
Functions ¶
func IsCorrupted ¶
IsCorrupted returns true if the error indicates a corrupted PDF.
func IsEncrypted ¶
IsEncrypted returns true if the error indicates an encrypted PDF.
Types ¶
type Document ¶
type Document struct {
// contains filtered or unexported fields
}
Document represents an opened PDF document.
Document provides methods for reading document properties and extracting content. It must be closed after use to release resources.
Example:
doc, err := gxpdf.Open("document.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
fmt.Printf("Pages: %d\n", doc.PageCount())
tables := doc.ExtractTables()
func MustOpen ¶
MustOpen opens a PDF file and panics on error.
This is useful for initialization in tests or when the file is known to exist.
Example:
doc := gxpdf.MustOpen("known-good.pdf")
defer doc.Close()
func Open ¶
Open opens a PDF file and returns a Document for reading.
This is the main entry point for reading PDF files. The returned Document must be closed after use.
Example:
doc, err := gxpdf.Open("document.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
fmt.Printf("Pages: %d\n", doc.PageCount())
func OpenWithContext ¶
OpenWithContext opens a PDF file with a custom context.
The context can be used for cancellation and timeouts.
Example:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() doc, err := gxpdf.OpenWithContext(ctx, "large-document.pdf")
func (*Document) Close ¶
Close closes the document and releases resources.
It is safe to call Close multiple times.
func (*Document) ExtractTables ¶
ExtractTables extracts all tables from all pages.
This is the simplest way to extract tables - uses automatic detection with the 4-Pass Hybrid algorithm for best accuracy.
Example:
tables := doc.ExtractTables()
for _, t := range tables {
fmt.Printf("Table on page %d: %d rows x %d cols\n",
t.PageNumber(), t.RowCount(), t.ColumnCount())
}
func (*Document) ExtractTablesFromPage ¶
ExtractTablesFromPage extracts tables from a specific page (1-based).
func (*Document) ExtractTablesWithOptions ¶
func (d *Document) ExtractTablesWithOptions(opts *ExtractionOptions) ([]*Table, error)
ExtractTablesWithOptions extracts tables with custom options.
Example:
opts := &gxpdf.ExtractionOptions{
Method: gxpdf.MethodLattice,
Pages: []int{0, 1, 2},
}
tables, err := doc.ExtractTablesWithOptions(opts)
func (*Document) ExtractTextFromPage ¶
ExtractTextFromPage extracts text from a specific page (1-based).
func (*Document) GetImages ¶
GetImages extracts all images from all pages in the document.
This is the simplest way to extract images - returns all images found across all pages.
Example:
images := doc.GetImages()
for i, img := range images {
fmt.Printf("Image %d: %dx%d, %s\n", i, img.Width(), img.Height(), img.ColorSpace())
img.SaveToFile(fmt.Sprintf("image_%d.jpg", i))
}
func (*Document) GetImagesWithError ¶
GetImagesWithError extracts all images from all pages, returning any errors.
Use this when you need error handling for image extraction.
func (*Document) IsEncrypted ¶
IsEncrypted returns true if the document is encrypted.
func (*Document) Page ¶
Page returns the page at the given index (0-based).
Returns nil if the index is out of bounds.
func (*Document) Pages ¶
Pages returns an iterator over all pages.
Example:
for _, page := range doc.Pages() {
text := page.ExtractText()
fmt.Println(text)
}
type DocumentInfo ¶
type DocumentInfo struct {
PageCount int
Path string
Version string
Title string
Author string
Subject string
Keywords string
Creator string
Producer string
Encrypted bool
}
DocumentInfo contains metadata about a PDF document.
type ExtractionMethod ¶
type ExtractionMethod int
ExtractionMethod specifies the table detection algorithm.
const ( // MethodAuto automatically selects the best method. // Uses Lattice if ruling lines are detected, otherwise Stream. MethodAuto ExtractionMethod = iota // MethodLattice uses ruling lines (borders) to detect tables. // Best for tables with visible borders. MethodLattice // MethodStream uses whitespace analysis to detect tables. // Best for tables without borders. MethodStream // MethodHybrid uses the 4-Pass Hybrid algorithm. // Best accuracy for complex tables like bank statements. MethodHybrid )
func (ExtractionMethod) String ¶
func (m ExtractionMethod) String() string
String returns the name of the extraction method.
type ExtractionOptions ¶
type ExtractionOptions struct {
// Method specifies the table detection algorithm.
// Default: MethodAuto
Method ExtractionMethod
// Pages specifies which pages to process (0-based indices).
// Empty slice means all pages.
Pages []int
// MinRowHeight is the minimum height for a row in points.
// Rows shorter than this are merged with adjacent rows.
// Default: 0 (auto-detect)
MinRowHeight float64
// MinColumnWidth is the minimum width for a column in points.
// Default: 0 (auto-detect)
MinColumnWidth float64
// MergeMultilineRows merges cells that span multiple lines.
// Default: true
MergeMultilineRows bool
}
ExtractionOptions configures table extraction behavior.
func DefaultExtractionOptions ¶
func DefaultExtractionOptions() *ExtractionOptions
DefaultExtractionOptions returns the default extraction options.
func (*ExtractionOptions) WithMergeMultilineRows ¶
func (o *ExtractionOptions) WithMergeMultilineRows(merge bool) *ExtractionOptions
WithMergeMultilineRows enables or disables multiline row merging.
func (*ExtractionOptions) WithMethod ¶
func (o *ExtractionOptions) WithMethod(method ExtractionMethod) *ExtractionOptions
WithMethod sets the extraction method.
func (*ExtractionOptions) WithPages ¶
func (o *ExtractionOptions) WithPages(pages ...int) *ExtractionOptions
WithPages sets the pages to process.
type Image ¶
type Image struct {
// contains filtered or unexported fields
}
Image represents an image extracted from a PDF.
This is a thin wrapper around the internal Image value object, providing a clean public API.
Example:
images := doc.GetImages()
for i, img := range images {
fmt.Printf("Image %d: %dx%d\n", i, img.Width(), img.Height())
img.SaveToFile(fmt.Sprintf("image_%d.jpg", i))
}
func (*Image) BitsPerComponent ¶
BitsPerComponent returns bits per color component (typically 8).
func (*Image) ColorSpace ¶
ColorSpace returns the PDF color space name.
Common values: "DeviceRGB", "DeviceGray", "DeviceCMYK", "Indexed"
func (*Image) Filter ¶
Filter returns the original PDF filter used for compression.
Common values: "/DCTDecode" (JPEG), "/FlateDecode" (zlib)
func (*Image) SaveToFile ¶
SaveToFile saves the image to a file.
The file format is determined by the extension:
- .jpg, .jpeg: JPEG format (best for DCTDecode images)
- .png: PNG format (best for lossless images)
For DCTDecode (JPEG) images, the original data is saved directly without re-encoding, preserving quality.
Example:
err := img.SaveToFile("extracted_image.jpg")
if err != nil {
log.Fatal(err)
}
func (*Image) ToGoImage ¶
ToGoImage converts the image to Go's standard image.Image.
This is useful for further processing with Go's image libraries.
Example:
goImg, err := img.ToGoImage()
if err != nil {
log.Fatal(err)
}
// Process with Go image libraries
resized := resize.Resize(100, 100, goImg, resize.Lanczos3)
type Page ¶
type Page struct {
// contains filtered or unexported fields
}
Page represents a single page in a PDF document.
func (*Page) ExtractTables ¶
ExtractTables extracts all tables from this page.
Example:
tables := page.ExtractTables()
for _, t := range tables {
fmt.Println(t.Rows())
}
func (*Page) ExtractTablesWithOptions ¶
func (p *Page) ExtractTablesWithOptions(opts *ExtractionOptions) ([]*Table, error)
ExtractTablesWithOptions extracts tables with custom options.
func (*Page) ExtractText ¶
ExtractText extracts all text from the page.
Returns the text content as a single string.
Example:
text := page.ExtractText() fmt.Println(text)
func (*Page) GetImages ¶
GetImages extracts all images from this page.
Returns all images found on the page as a slice.
Example:
images := page.GetImages()
for i, img := range images {
fmt.Printf("Image %d: %dx%d\n", i, img.Width(), img.Height())
img.SaveToFile(fmt.Sprintf("page%d_image%d.jpg", page.Number(), i))
}
func (*Page) GetImagesWithError ¶
GetImagesWithError extracts all images from this page, returning any errors.
Use this when you need error handling for image extraction.
type Table ¶
type Table struct {
// contains filtered or unexported fields
}
Table represents an extracted table from a PDF document.
Table provides methods to access table data and export to various formats.
Example:
tables := doc.ExtractTables()
for _, t := range tables {
rows := t.Rows()
for _, row := range rows {
fmt.Println(row)
}
}
func (*Table) Cell ¶
Cell returns the text content of a cell at the given row and column.
Returns empty string if the position is out of bounds.
func (*Table) ColumnCount ¶
ColumnCount returns the number of columns in the table.
func (*Table) ExportExcel ¶
ExportExcel exports the table to Excel format.
func (*Table) ExportJSON ¶
ExportJSON exports the table to JSON format.
func (*Table) Internal ¶
func (t *Table) Internal() *internaltable.Table
Internal returns the internal table representation.
This is for advanced users who need access to cell bounds, alignment, and other detailed information.
func (*Table) Method ¶
Method returns the extraction method used ("Lattice", "Stream", or "Hybrid").
func (*Table) PageNumber ¶
PageNumber returns the page number where the table was found (0-based).
func (*Table) Rows ¶
Rows returns the table data as a 2D string slice.
This is the simplest way to access table data.
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
gxpdf
command
Package main provides the gxpdf command-line interface.
|
Package main provides the gxpdf command-line interface. |
|
gxpdf/commands
Package commands implements the gxpdf CLI commands.
|
Package commands implements the gxpdf CLI commands. |
|
Package creator provides a high-level API for creating and modifying PDF documents.
|
Package creator provides a high-level API for creating and modifying PDF documents. |
|
forms
Package forms provides interactive form field support for PDF documents.
|
Package forms provides interactive form field support for PDF documents. |
|
Example: Creating a document with chapters and table of contents
|
Example: Creating a document with chapters and table of contents |
|
acroform-buttons
command
Package main demonstrates AcroForm checkbox and radio button creation.
|
Package main demonstrates AcroForm checkbox and radio button creation. |
|
acroform-textfields
command
Package main demonstrates creating PDF forms with text fields.
|
Package main demonstrates creating PDF forms with text fields. |
|
bookmarks
command
Package main demonstrates PDF bookmark/outline support.
|
Package main demonstrates PDF bookmark/outline support. |
|
complex_shapes
command
Package main demonstrates the use of complex vector shapes in PDF creation.
|
Package main demonstrates the use of complex vector shapes in PDF creation. |
|
compression
command
Package main demonstrates PDF stream compression using FlateDecode.
|
Package main demonstrates PDF stream compression using FlateDecode. |
|
creator-annotations
command
|
|
|
creator/cmyk_colors
command
Package main demonstrates CMYK color support in the Creator API.
|
Package main demonstrates CMYK color support in the Creator API. |
|
custom-font
command
|
|
|
encryption
command
Package main demonstrates RC4 encryption support in gxpdf.
|
Package main demonstrates RC4 encryption support in gxpdf. |
|
encryption-aes
command
Package main demonstrates AES encryption for PDF documents.
|
Package main demonstrates AES encryption for PDF documents. |
|
gradients
command
Package main demonstrates gradient fills in PDF creation.
|
Package main demonstrates gradient fills in PDF creation. |
|
image-extraction
command
Package main demonstrates how to extract images from PDF documents.
|
Package main demonstrates how to extract images from PDF documents. |
|
image_embedding
command
|
|
|
list
command
|
|
|
reader
command
Package main demonstrates how to use the PDF Reader to read and inspect PDF files.
|
Package main demonstrates how to use the PDF Reader to read and inspect PDF files. |
|
rotation
command
Package main demonstrates page rotation in PDF documents.
|
Package main demonstrates page rotation in PDF documents. |
|
styled_paragraph
command
|
|
|
table-detection
command
Package main demonstrates table detection from PDF documents.
|
Package main demonstrates table detection from PDF documents. |
|
table-extraction
command
Package main demonstrates table extraction from PDFs using gxpdf.
|
Package main demonstrates table extraction from PDFs using gxpdf. |
|
text-extraction
command
Package main demonstrates text extraction with positional information.
|
Package main demonstrates text extraction with positional information. |
|
text-rendering
command
|
|
|
watermark
command
Package main demonstrates watermark functionality in GxPDF.
|
Package main demonstrates watermark functionality in GxPDF. |
|
Package export provides table export functionality.
|
Package export provides table export functionality. |
|
internal
|
|
|
document
Package document provides the domain model for PDF document creation.
|
Package document provides the domain model for PDF document creation. |
|
encoding
Package encoding implements PDF stream encoding and decoding filters.
|
Package encoding implements PDF stream encoding and decoding filters. |
|
extractor
Package extractor implements PDF content extraction use cases.
|
Package extractor implements PDF content extraction use cases. |
|
fonts
Package fonts provides the Standard 14 Type 1 fonts that are built into all PDF readers.
|
Package fonts provides the Standard 14 Type 1 fonts that are built into all PDF readers. |
|
models/content
Package content defines the domain model for PDF page content.
|
Package content defines the domain model for PDF page content. |
|
models/table
Package table provides domain entities for PDF table extraction.
|
Package table provides domain entities for PDF table extraction. |
|
models/types
Package valueobjects contains value objects used across the domain.
|
Package valueobjects contains value objects used across the domain. |
|
parser
Package parser implements PDF lexical analysis (tokenization) according to PDF 1.7 specification, Section 7.2 (Lexical Conventions).
|
Package parser implements PDF lexical analysis (tokenization) according to PDF 1.7 specification, Section 7.2 (Lexical Conventions). |
|
reader
Package reader provides application-layer PDF reading functionality.
|
Package reader provides application-layer PDF reading functionality. |
|
security
Package security provides PDF encryption and security features.
|
Package security provides PDF encryption and security features. |
|
tabledetect
Package detector implements table detection algorithms.
|
Package detector implements table detection algorithms. |
|
writer
Package writer implements PDF writing infrastructure.
|
Package writer implements PDF writing infrastructure. |