Convert PDF file to Base64 for indexing in Elasticsearch

Question

Convert PDF file to Base64 for indexing in Elasticsearch

I need to index PDF files in Elasticsearch. For this I need to convert the files to base64. I will be using Attachment Display .

I used the following Python code to convert the file to Base64 encoding:

from elasticsearch import Elasticsearch
import base64
import constants

def index_pdf(pdf_filename):
    encoded = ""
    with open(pdf_filename) as f:
        data = f.readlines()
        for line in data:
            encoded += base64.b64encode(f.readline())
    return encoded

if __name__ == "__main__":
    encoded_pdf = index_pdf("Test.pdf")
    INDEX_DSL = {
        "pdf_id": "1",
        "text": encoded_pdf
    }
    constants.ES_CLIENT.index(
            index=constants.INDEX_NAME,
            doc_type=constants.TYPE_NAME,
            body=INDEX_DSL,
            id="1"
    )

Index creation as well as document indexing is excellent. The only problem is that I don't think the file was encoded correctly. I have tried encoding this file using online tools and I end up with a completely different encoding, which is larger than what I get with Python.

Here is the PDF file.

I tried to query for text data as suggested in the Plugin Documentation.

GET index_pdf/pdf/_search
{
  "query": {
    "match": {
      "text": "piece text"
    }
  }
}

I give my zero images. How should I do it?

+3

python pdf elasticsearch

Animesh pandey 08 jul. 15 at 21:34

source to share

1 answer

keety · Accepted Answer · 2015-07-09T18:09:33+0000

The encoding snippet is not correct, it opens the pdf file in "text" mode.

Depending on the size of the file, you can simply open the file in binary mode and use the encode string method Example:

def pdf_encode(pdf_filename):
    return open(pdf_filename,"rb").read().encode("base64");

or if the file size is large, you may need to break the encoding into chunks without looking to see if there is a module for that, but it could be as simple as the example below. Code:

 def chunk_24_read(pdf_filename) :
    with open(pdf_filename,"rb") as f:
        byte = f.read(3)
        while(byte) :
            yield  byte
            byte = f.read(3)


def pdf_encode(pdf_filename):
    encoded = ""
    length = 0
    for data in chunk_24_read(pdf_filename):
        for char in base64.b64encode(data) :
            if(length  and  length % 76 == 0):
               encoded += "\n"
               length = 0

            encoded += char  
            length += 1
    return encoded

Convert PDF file to Base64 for indexing in Elasticsearch

More articles: