Convert PDF file to Base64 for indexing in Elasticsearch
I need to index PDF files in Elasticsearch. For this I need to convert the files to base64. I will be using Attachment Display .
I used the following Python code to convert the file to Base64 encoding:
from elasticsearch import Elasticsearch
import base64
import constants
def index_pdf(pdf_filename):
encoded = ""
with open(pdf_filename) as f:
data = f.readlines()
for line in data:
encoded += base64.b64encode(f.readline())
return encoded
if __name__ == "__main__":
encoded_pdf = index_pdf("Test.pdf")
INDEX_DSL = {
"pdf_id": "1",
"text": encoded_pdf
}
constants.ES_CLIENT.index(
index=constants.INDEX_NAME,
doc_type=constants.TYPE_NAME,
body=INDEX_DSL,
id="1"
)
Index creation as well as document indexing is excellent. The only problem is that I don't think the file was encoded correctly. I have tried encoding this file using online tools and I end up with a completely different encoding, which is larger than what I get with Python.
Here is the PDF file.
I tried to query for text data as suggested in the Plugin Documentation.
GET index_pdf/pdf/_search
{
"query": {
"match": {
"text": "piece text"
}
}
}
I give my zero images. How should I do it?
source to share
The encoding snippet is not correct, it opens the pdf file in "text" mode.
Depending on the size of the file, you can simply open the file in binary mode and use the encode string method Example:
def pdf_encode(pdf_filename):
return open(pdf_filename,"rb").read().encode("base64");
or if the file size is large, you may need to break the encoding into chunks without looking to see if there is a module for that, but it could be as simple as the example below. Code:
def chunk_24_read(pdf_filename) :
with open(pdf_filename,"rb") as f:
byte = f.read(3)
while(byte) :
yield byte
byte = f.read(3)
def pdf_encode(pdf_filename):
encoded = ""
length = 0
for data in chunk_24_read(pdf_filename):
for char in base64.b64encode(data) :
if(length and length % 76 == 0):
encoded += "\n"
length = 0
encoded += char
length += 1
return encoded
source to share