PDF to DOCX Converter

In this article we will create a serverless PDF to DOCX converter using different tools including:

Lambda
Docker
Python
S3
Serverless Application Repository

The app will upload PDF files to a source S3 bucket using an existing application available on the Serverless Application Repository, the upload event will trigger the python Lambda function which wrapped as a docker image to convert the file to DOCX format and save it to a destination S3 bucket.

Pre-Requisites:

Docker - See Link for guidance.
Python - See Link for guidance.
AWS CLI.
VS Code - Download Link.

Two S3 buckets, to act as a source and destination for the documents.

1- Prepare the docker image:

Create a working folder and create a new file named lambda-function.py , and paste in the below code:

      import json
      import boto3
      import os
      from pdf2docx import Converter

      s3 = boto3.client('s3')
      lambda_client = boto3.client('lambda')

      def handler(event, context):
          bucket_name = event['Records'][0]['s3']['bucket']['name']
          file_key = event['Records'][0]['s3']['object']['key']

          # Convert the PDF file to DOCX
          convert_pdf_to_docx(bucket_name, file_key)

          return {
              'statusCode': 200,
              'body': json.dumps('File conversion successful!')
          }

      def convert_pdf_to_docx(source_bucket, source_key):
          # Download the PDF file from the source bucket
          tmp_pdf_file = '/tmp/input.pdf'
          s3.download_file(source_bucket, source_key, tmp_pdf_file)

          # Convert the PDF file to DOCX
          destination_bucket_name = '<S3-DESTINATION-BUCKET-NAME>'
          destination_object_key = source_key.replace('.pdf', '.docx')

          cv = Converter(tmp_pdf_file)
          cv.convert('/tmp/output.docx')
          cv.close()

          # Upload the DOCX file to the destination bucket
          s3.upload_file('/tmp/output.docx', destination_bucket_name, destination_object_key)

          # Clean up the temporary files
          os.remove(tmp_pdf_file)
          os.remove('/tmp/output.docx')

Make sure to update the destination bucket name placeholder <S3-DESTINATION-BUCKET-NAME>.

Install python dependencies:

      pip install boto3
      pip install pdf2docx

Run following command to install pipreqs which exports python dependencies to requirements.txt which will be used by Docker to install dependencies.
```
    pip install pipreqs
    pipreqs .
```
To allow the build command to automatically resolve the version dependencies you can remove version numbers from requirements.txt.
Make sure AWS CLI is configured and run below command to setup a repository in ECR to host the image:

    aws ecr create-repository --repository-name pdf2word --region us-east-1

Now prepare the image by creating the Dockerfile in the same directory that has requirements.txt and paste in below build instruction. This Dockerfile uses the public Amazon Linux 2 image hosted on Amazon ECR Public.

    FROM public.ecr.aws/lambda/python:3.7.2024.04.17.17
    COPY requirements.txt ${LAMBDA_TASK_ROOT}
    RUN pip install -r requirements.txt
    COPY lambda-function.py ${LAMBDA_TASK_ROOT}
    CMD [ "lambda-function.handler" ]

Build the image by running below command replacing ACCOUNT-NUMBER with your actual account number:

    docker build -t ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word .

Tag the image:

    sudo docker tag ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word:latest

    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com

Push the image to ECR:

    docker push ACCOUNT-NUMBER.dkr.ecr.us-east-1.amazonaws.com/pdf2word:latest

Now the image is pushed to the ECR repo and ready to be used in a Lambda function.

Create Lambda Function
Create a new Lambda Function, select Container Image as the source and give it a name, then click browse for the container image.
We need to grant Lambda permission to write to S3 and set the trust policy done in next steps.
Browse to pdf2wordConverter-role automatically created by Lambda, and go to Permissions and click Add Inline Policy.
Select S3 as the service
Browse to Write Permissions and select PutObject
In the resources add in the ARN of your destination S3 bucket
Give the policy a name and save, now the Lambda function should have write permission to your destination S3 bucket.
On the destination bucket we need to amend the trust policy to allow the Lambda role to put items on the bucket, browse to bucket, under permissions tab click to Edit Bucket Policy and past in:

    {
        "Version": "2012-10-17",
        "Id": "Policy1714654891179",
        "Statement": [
            {
                "Sid": "Stmt1714654883800",
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "LAMBDA-ROLE-ARN"
                    ]
                },
                "Action": "s3:*",
                "Resource": "arn:aws:s3:::pdf2word-destination"
            }
        ]
    }

Now we need to configure Lambda function to be triggered by upload events in the source bucket.
Browse to Lambda function and click to add a trigger
Select the source S3 bucket with All object create events and confirm the trigger.
Now our backend is ready and we can create the frontend from the Serverless Application Repository.
Setting up the frontend
This serverless application for it's author functions as frontend to upload files to S3, in the background it deploys another Lambda function to act as the UI, and an API Gateway to handle requests from the user proxies the uploaded file to the lambda which Puts it to the selected S3 bucket.
Once deployment complete you can browse to the provided API endpoint and test the application by dragging and dropping a pdf file and check on the source and destination buckets for the converted files.
Cleaning up
Empty and delete source and destination S3 buckets
Delete the docker image from ECR
Delete the Lambda function
Uploader serverless app, can be deleted from Applications panel in Lambda, or by deleting the CloudFormation stack.