Okay, so I wanted to mess around with PDFs in Python, specifically using the `tampapdf` library. I’d heard it was good for scraping data from tables in PDFs, and I had a project in mind. Here’s how it went down:
Getting Started
First, I made sure I had Python installed. Check! Then, I needed the `tampapdf` library itself. I opened up my command prompt (I’m on Windows) and typed:
pip install tampapdf
It downloaded and installed everything. I thought it’s good to go.
The PDF Struggle
Next, I needed a PDF to play with. I found an example online for testing and I just grab it.
I saved it to my computer in the same folder as my Python script, just to keep things simple. I named it “*”.
Coding Time
I created a new Python file (I called it `pdf_*`). Then, I started writing some code. It was trial and error:
I started by importing the `tampapdf` library, so top of my code I wrote:
from tampapdf import cli
I read from the tampapdf’s help that I need to call it using `*`, so I wrote the code:
if __name__ == '__main__':
Running and Testing
The next thing is I tested run it. I opened the command prompt in my project folder
Then try to call from the command line using:
python pdf_* discover *
And I got a bunch of output on my screen that shown the table-like parts in the pdf.
Then, I tried to extract the text from the file by running:
python pdf_* extract *
I successfully got a json format text extracted from the PDF, it is good!
What I Learned
Install is easy: Getting `tampapdf` set up with `pip` was super straightforward.
Command-line usage: `tampapdf` is primarily used through the command line. The basic syntax I used was python pdf_* discover * or python pdf_* extract *, replacing “*” with my actual file name.
Output: The command will give a lot of JSON text.
Keep practicing: I’m definitely going to keep playing with this. I think it could be really useful for automating some tedious tasks I have.
This was just my first quick experiment. There’s a lot more to explore with `tampapdf`, but this was a good start!