Add PDFs or DOCX files to your dataset in Python
To add a PDF or DOCX file, you'll first need to read the file and convert its content to text. You can use the pdfplumber
library for PDF files and the python-docx
library for DOCX files.
First, install the required libraries:
pip install pdfplumber python-docx
Then, you can use the following code to read a PDF or DOCX file and add its content to your dataset:
import pdfplumber
from docx import Document
from embedbase_client.client import EmbedbaseClient
def read_pdf(file_path):
with pdfplumber.open(file_path) as pdf:
content = ""
for page in pdf.pages:
content += page.extract_text()
return content
def read_docx(file_path):
doc = Document(file_path)
content = ""
for paragraph in doc.paragraphs:
content += paragraph.text + "\n"
return content
embedbase = EmbedbaseClient('https://api.embedbase.xyz', '<grab me here https://app.embedbase.xyz/>')
pdf_file_path = "path/to/your/pdf_file.pdf"
docx_file_path = "path/to/your/docx_file.docx"
pdf_content = read_pdf(pdf_file_path)
docx_content = read_docx(docx_file_path)
dataset_id = 'document-content'
data = embedbase.dataset(dataset_id).chunk_and_batch_add([{'data': pdf_content}, {'data': docx_content}])
print(data)
Replace path/to/your/pdf_file.pdf
and path/to/your/docx_file.docx
with the actual file paths of your PDF and DOCX files.