Serverless PDF to DOCX Converter

In this article we will create a serverless PDF to DOCX converter using different tools including:

  • Lambda

  • Docker

  • Python

  • S3

  • Serverless Application Repository

The app will upload PDF files to a source S3 bucket using an existing application available on the Serverless Application Repository, the upload event will trigger the python Lambda function which wrapped as a docker image to convert the file to DOCX format and save it to a destination S3 bucket.

Pre-Requisites:

  • Docker - See Link for guidance.

  • Python - See Link for guidance.

  • AWS CLI.

  • VS Code - Download Link.

  • Two S3 buckets, to act as a source and destination for the documents.

    1- Prepare the docker image:

    • Create a working folder and create a new file named lambda-function.py , and paste in the below code:

    •       import json
            import boto3
            import os
            from pdf2docx import Converter
      
            s3 = boto3.client('s3')
            lambda_client = boto3.client('lambda')
      
            def handler(event, context):
                bucket_name = event['Records'][0]['s3']['bucket']['name']
                file_key = event['Records'][0]['s3']['object']['key']
      
                # Convert the PDF file to DOCX
                convert_pdf_to_docx(bucket_name, file_key)
      
                return {
                    'statusCode': 200,
                    'body': json.dumps('File conversion successful!')
                }
      
            def convert_pdf_to_docx(source_bucket, source_key):
                # Download the PDF file from the source bucket
                tmp_pdf_file = '/tmp/input.pdf'
                s3.download_file(source_bucket, source_key, tmp_pdf_file)
      
                # Convert the PDF file to DOCX
                destination_bucket_name = '<S3-DESTINATION-BUCKET-NAME>'
                destination_object_key = source_key.replace('.pdf', '.docx')
      
                cv = Converter(tmp_pdf_file)
                cv.convert('/tmp/output.docx')
                cv.close()
      
                # Upload the DOCX file to the destination bucket
                s3.upload_file('/tmp/output.docx', destination_bucket_name, destination_object_key)
      
                # Clean up the temporary files
                os.remove(tmp_pdf_file)
                os.remove('/tmp/output.docx')
      

      Make sure to update the destination bucket name placeholder <S3-DESTINATION-BUCKET-NAME>.

    • Install python dependencies:

    •       pip install boto3
            pip install pdf2docx
      
    • Run following command to install pipreqs which exports python dependencies to requirements.txt which will be used by Docker to install dependencies.

    •     pip install pipreqs
          pipreqs .
      

      To allow the build command to automatically resolve the version dependencies you can remove version numbers from requirements.txt.

    • Make sure AWS CLI is configured and run below command to setup a repository in ECR to host the image:

    •     aws ecr create-repository --repository-name pdf2word --region us-east-1
      
    • Now prepare the image by creating the Dockerfile in the same directory that has requirements.txt and paste in below build instruction. This Dockerfile uses the public Amazon Linux 2 image hosted on Amazon ECR Public.

    •     FROM public.ecr.aws/lambda/python:3.7.2024.04.17.17
          COPY requirements.txt ${LAMBDA_TASK_ROOT}
          RUN pip install -r requirements.txt
          COPY lambda-function.py ${LAMBDA_TASK_ROOT}
          CMD [ "lambda-function.handler" ]
      

      Build the image by running below command replacing ACCOUNT-NUMBER with your actual account number:

    •     docker build -t ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word .
      
    • Tag the image:

    •     sudo docker tag ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word:latest
      
    • Login to ECR by running:

    •     aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com
      

      Push the image to ECR:

    •     docker push ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word:latest
      

      Now the image is pushed to the ECR repo and ready to be used in a Lambda function.

    • Create Lambda Function

    • Create a new Lambda Function, select Container Image as the source and give it a name, then click browse for the container image.

    • We need to grant Lambda permission to write to S3 and set the trust policy done in next steps.

    • Browse to pdf2wordConverter-role automatically created by Lambda, and go to Permissions and click Add Inline Policy.

    • Select S3 as the service

    • Browse to Write Permissions and select PutObject

    • In the resources add in the ARN of your destination S3 bucket

    • Give the policy a name and save, now the Lambda function should have write permission to your destination S3 bucket.

    • On the destination bucket we need to amend the trust policy to allow the Lambda role to put items on the bucket, browse to bucket, under permissions tab click to Edit Bucket Policy and past in:

    •     {
              "Version": "2012-10-17",
              "Id": "Policy1714654891179",
              "Statement": [
                  {
                      "Sid": "Stmt1714654883800",
                      "Effect": "Allow",
                      "Principal": {
                          "AWS": [
                              "LAMBDA-ROLE-ARN"
                          ]
                      },
                      "Action": "s3:*",
                      "Resource": "arn:aws:s3:::pdf2word-destination"
                  }
              ]
          }
      
    • Now we need to configure Lambda function to be triggered by upload events in the source bucket.

    • Browse to Lambda function and click to add a trigger

    • Select the source S3 bucket with All object create events and confirm the trigger.

    • Now our backend is ready and we can create the frontend from the Serverless Application Repository.

    • Setting up the frontend

    • This serverless application for it's author functions as frontend to upload files to S3, in the background it deploys another Lambda function to act as the UI, and an API Gateway to handle requests from the user proxies the uploaded file to the lambda which Puts it to the selected S3 bucket.

    • Once deployment complete you can browse to the provided API endpoint and test the application by dragging and dropping a pdf file and check on the source and destination buckets for the converted files.

    • Cleaning up

    • Empty and delete source and destination S3 buckets

    • Delete the docker image from ECR

    • Delete the Lambda function

    • Uploader serverless app, can be deleted from Applications panel in Lambda, or by deleting the CloudFormation stack.

I hope you enjoyed going through this tutorial as much as I did creating it :)