Welcome to my latest blog post! 🎉 Today, we’re going to dive into an exciting project that transforms unstructured data into structured data. If you’ve ever been overwhelmed by the amount of information in a multi-page PDF and wished there was an easier way to analyze it, this post is for you! 📚➡️🔍
We’ll be exploring how to extract text from PDF files and store it as key-value pairs in a CSV file using AWS services. This process will allow us to easily manipulate and analyze the data. We’ll also get our hands dirty by setting up an S3 bucket, understanding the role of various Lambda functions, and even provisioning AWS resources using a CloudFormation script. 🚀
Project Introduction 🎉
This project is designed to extract text from multi-page PDF files and store the extracted data as key-value pairs in a CSV file. The primary functionality is to convert unstructured data (text in PDF files) into structured data (CSV files), which can be easily manipulated and analyzed.
Getting Started and Getting Hands Dirty 🚀
Part 1: Creating an S3 Bucket and Uploading Source Code 📦
To get started, you'll need to create an S3 bucket to upload the source code which will be used during the automated provisioning of the AWS resources. You can find the zip files to upload in the resources folder of my GitHub repository. These zip files mainly include source code for my frontend application and source code for all lambda functions used in the application.
Please refer to the official AWS documentation on Creating a Bucket for detailed steps.
Part 2: Describing Each Lambda Function and Providing Source Code 📝
Text Detection Lambda
The overall purpose of this code is to extract text from pdf uploaded to S3 bucket using AWS Textract, and then notify an SNS topic about the job status. The extracted text is stored in the same S3 bucket.
import os
import json
import boto3
from urllib.parse import unquote_plus
# Environment variables
OUTPUT_BUCKET_NAME = os.environ["OUTPUT_BUCKET_NAME"]
OUTPUT_S3_PREFIX = os.environ["OUTPUT_S3_PREFIX"]
SNS_TOPIC_ARN = os.environ["SNS_TOPIC_ARN"]
SNS_ROLE_ARN = os.environ["SNS_ROLE_ARN"]
def lambda_handler(event, context):
"""Main Lambda handler triggered by S3 event."""
try:
# Initialize AWS clients
textract = boto3.client("textract")
sns = boto3.client("sns")
# Check if event is not empty
if event:
# Extract bucket name and file name from the event
file_obj = event["Records"][0]
bucket_name = str(file_obj["s3"]["bucket"]["name"])
file_name = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print(f"Bucket: {bucket_name} ::: Key: {file_name}")
# Start Textract job
textract_response = textract.start_document_text_detection(
DocumentLocation={"S3Object": {"Bucket": bucket_name, "Name": file_name}},
OutputConfig={"S3Bucket": OUTPUT_BUCKET_NAME, "S3Prefix": OUTPUT_S3_PREFIX},
)
# Check if Textract job was started successfully
if textract_response["ResponseMetadata"]["HTTPStatusCode"] == 200:
print("Textract job created successfully!")
# Publish a message to the SNS topic with the Textract response
sns.publish(
TopicArn=SNS_TOPIC_ARN,
Message=json.dumps(textract_response),
Subject='Textract Job Created'
)
return {"statusCode": 200, "body": json.dumps("Job created and SNS message sent successfully!")}
else:
return {"statusCode": 500, "body": json.dumps("Job creation failed!")}
except Exception as e:
# Handle the exception
error_message = f"An error occurred: {str(e)}"
print(error_message)
return {"statusCode": 500, "body": json.dumps(error_message)}
Once the Textract job is completed, this lambda sends SNS to another lambda which is CSV generator lambda.
CSV Generator Lambda
The overall purpose of this function is to process the response of a Textract job, prepare a CSV file with the extracted text, and upload the CSV file to the S3 bucket.
import os
import json
import boto3
from urllib.parse import unquote_plus
# Environment variables
OUTPUT_BUCKET_NAME = os.environ["OUTPUT_BUCKET_NAME"]
OUTPUT_S3_PREFIX = os.environ["OUTPUT_S3_PREFIX"]
SNS_TOPIC_ARN = os.environ["SNS_TOPIC_ARN"]
SNS_ROLE_ARN = os.environ["SNS_ROLE_ARN"]
def lambda_handler(event, context):
"""Main Lambda handler triggered by S3 event."""
try:
# Initialize AWS clients
textract = boto3.client("textract")
sns = boto3.client("sns")
# Check if event is not empty
if event:
# Extract bucket name and file name from the event
file_obj = event["Records"][0]
bucket_name = str(file_obj["s3"]["bucket"]["name"])
file_name = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print(f"Bucket: {bucket_name} ::: Key: {file_name}")
# Start Textract job
textract_response = textract.start_document_text_detection(
DocumentLocation={"S3Object": {"Bucket": bucket_name, "Name": file_name}},
OutputConfig={"S3Bucket": OUTPUT_BUCKET_NAME, "S3Prefix": OUTPUT_S3_PREFIX},
)
# Check if Textract job was started successfully
if textract_response["ResponseMetadata"]["HTTPStatusCode"] == 200:
print("Textract job created successfully!")
# Publish a message to the SNS topic with the Textract response
sns.publish(
TopicArn=SNS_TOPIC_ARN,
Message=json.dumps(textract_response),
Subject='Textract Job Created'
)
return {"statusCode": 200, "body": json.dumps("Job created and SNS message sent successfully!")}
else:
return {"statusCode": 500, "body": json.dumps("Job creation failed!")}
except Exception as e:
# Handle the exception
error_message = f"An error occurred: {str(e)}"
print(error_message)
return {"statusCode": 500, "body": json.dumps(error_message)}
CSV Download Lambda
The overall purpose of this code is to provide pre-signed URLs for downloading CSV files from an S3 bucket.
import boto3
import os
import json
def lambda_handler(event, context):
s3 = boto3.client('s3')
bucket_name = os.environ['OUTPUT_BUCKET_NAME']
folder_name = 'csv/'
try:
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_name)
if 'Contents' in response:
files = [file['Key'] for file in response['Contents'] if file['Key'].endswith('.csv')]
if files:
urls = [s3.generate_presigned_url('get_object', Params={'Bucket': bucket_name, 'Key': file}) for file in files]
return {
'isBase64Encoded': False,
'statusCode': 200,
'headers': { 'Content-Type': 'application/json' },
'body': json.dumps(urls)
}
else:
return {
'isBase64Encoded': False,
'statusCode': 200,
'headers': { 'Content-Type': 'application/json' },
'body': json.dumps('No CSV files found.')
}
else:
return {
'isBase64Encoded': False,
'statusCode': 200,
'headers': { 'Content-Type': 'application/json' },
'body': json.dumps('Download not ready yet.')
}
except Exception as e:
return {
'isBase64Encoded': False,
'statusCode': 500,
'headers': { 'Content-Type': 'application/json' },
'body': json.dumps('Error: ' + str(e))
}
Part 3: Explaining the CloudFormation Script and Provisioning on AWS ☁️
In this part, I will explain my CloudFormation script. You can find the source code for the file below:
AWSTemplateFormatVersion: '2010-09-09'
Description: Asynchronous Text detection from multi-page pdf via Textract by Bhishman Desai
Parameters:
RoleArn:
Description: Existing IAM role ARN
Type: String
Default: arn:aws:iam::712922208151:role/LabRole
Resources:
# Compute
TextDetectionLambda:
Type: AWS::Lambda::Function
Properties:
Handler: lambda_function.lambda_handler
Runtime: python3.9
Role: !Ref RoleArn
Environment:
Variables:
OUTPUT_BUCKET_NAME: !Sub '${AWS::StackName}-${AWS::Region}-${AWS::AccountId}'
OUTPUT_S3_PREFIX: 'output'
SNS_TOPIC_ARN: !Ref TextExtractSNSTopic
SNS_ROLE_ARN: !Ref RoleArn
Code:
S3Bucket: term-end-code
S3Key: response_generator.zip
CSVGeneratorLambda:
Type: AWS::Lambda::Function
Properties:
Handler: lambda_function.lambda_handler
Runtime: python3.9
Timeout: 900
Role: !Ref RoleArn
Layers:
- !Ref PandasLayer
Environment:
Variables:
BUCKET_NAME: !Sub '${AWS::StackName}-${AWS::Region}-${AWS::AccountId}'
PREFIX: 'csv'
Code:
S3Bucket: term-end-code
S3Key: csv_generator.zip
CsvDownloadLambda:
Type: AWS::Lambda::Function
Properties:
Handler: lambda_function.lambda_handler
Runtime: python3.9
Role: !Ref RoleArn
Environment:
Variables:
OUTPUT_BUCKET_NAME: !Sub '${AWS::StackName}-${AWS::Region}-${AWS::AccountId}'
Code:
S3Bucket: term-end-code
S3Key: download_csv.zip
# Permissions
S3InvokeTextDetectionLambdaPermission:
Type: AWS::Lambda::Permission
DependsOn: TextDetectionLambda
Properties:
Action: lambda:InvokeFunction
FunctionName: !Ref TextDetectionLambda
Principal: s3.amazonaws.com
SourceArn: !Sub
- arn:aws:s3:::${TextExtractBucket}
- { TextExtractBucket: !Join [ '-', [ !Ref AWS::StackName, !Ref AWS::Region, !Ref AWS::AccountId ] ] }
SNSInvokeCSVGeneratorLambdaPermission:
Type: AWS::Lambda::Permission
DependsOn: CSVGeneratorLambda
Properties:
Action: lambda:InvokeFunction
FunctionName: !GetAtt CSVGeneratorLambda.Arn
Principal: sns.amazonaws.com
LambdaInvokePermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref CsvDownloadLambda
Action: lambda:InvokeFunction
Principal: apigateway.amazonaws.com
SourceArn:
Fn::Sub:
- arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${apiId}/*
- apiId: !Ref CsvApiGateway
# Layers
PandasLayer:
Type: AWS::Lambda::LayerVersion
Properties:
CompatibleRuntimes:
- python3.9
Content:
S3Bucket: term-end-code
S3Key: pandas.zip
Description: Pandas layer
Frontend:
Type: AWS::EC2::Instance
DependsOn:
- UploadUrlParameter
- EC2SecurityGroup
Properties:
ImageId: ami-0c101f26f147fa7fd
InstanceType: t2.micro
KeyName: term-end-ass
IamInstanceProfile: !Ref FrontendProfile
SecurityGroupIds:
- !Ref EC2SecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash -xe
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
yum update -y
yum install -y aws-cli
NEXT_PUBLIC_UPLOAD_URL=$(aws ssm get-parameter --name UploadUrl --query 'Parameter.Value' --output text)
echo "export NEXT_PUBLIC_UPLOAD_URL=$NEXT_PUBLIC_UPLOAD_URL" >> /etc/environment
source /etc/environment
NEXT_PUBLIC_DOWNLOAD_URL=$(aws ssm get-parameter --name DownloadUrl --query 'Parameter.Value' --output text)
echo "export NEXT_PUBLIC_DOWNLOAD_URL=$NEXT_PUBLIC_DOWNLOAD_URL" >> /etc/environment
source /etc/environment
aws s3 cp s3://term-end-code/frontend.zip /home/ec2-user/
yum install -y unzip
unzip /home/ec2-user/frontend.zip -d /home/ec2-user/
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.5/install.sh | bash
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
nvm install node
npm install -g pm2
cd /home/ec2-user/frontend
npm install
npm run build
pm2 start npm --name "next-app" -- start
# Security Group
EC2SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: 'My security group for EC2 frontend'
SecurityGroupIngress:
- IpProtocol: '-1'
CidrIp: '0.0.0.0/0'
# Profile
FrontendProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Roles:
- "LabRole"
# Storage
TextExtractBucket:
Type: AWS::S3::Bucket
DependsOn: TextDetectionLambda
Properties:
BucketName: !Sub '${AWS::StackName}-${AWS::Region}-${AWS::AccountId}'
NotificationConfiguration:
LambdaConfigurations:
- Event: s3:ObjectCreated:*
Function: !GetAtt TextDetectionLambda.Arn
Filter:
S3Key:
Rules:
- Name: prefix
Value: input/
- Name: suffix
Value: .pdf
# Network
TextExtractApiGateway:
Type: AWS::ApiGateway::RestApi
Properties:
Name: TextExtractApiGateway
Description: API Gateway with binary support for Text Extract
BinaryMediaTypes:
- '*/*'
CsvApiGateway:
Type: AWS::ApiGateway::RestApi
Properties:
Name: CsvApiGateway
Description: API Gateway for triggering the CsvDownloadLambda function
# Resource
UploadResource:
Type: AWS::ApiGateway::Resource
Properties:
RestApiId: !Ref TextExtractApiGateway
ParentId: !GetAtt
- TextExtractApiGateway
- RootResourceId
PathPart: upload
FilenameResource:
Type: AWS::ApiGateway::Resource
Properties:
RestApiId: !Ref TextExtractApiGateway
ParentId: !Ref UploadResource
PathPart: '{filename}'
CsvResource:
Type: AWS::ApiGateway::Resource
Properties:
RestApiId: !Ref CsvApiGateway
ParentId: !GetAtt
- CsvApiGateway
- RootResourceId
PathPart: download
# Method
UploadMethod:
Type: AWS::ApiGateway::Method
Properties:
RestApiId: !Ref TextExtractApiGateway
ResourceId: !Ref FilenameResource
HttpMethod: PUT
AuthorizationType: NONE
ApiKeyRequired: false
RequestParameters:
method.request.header.Content-Type: false
method.request.path.filename: true
Integration:
Type: AWS
IntegrationHttpMethod: PUT
Uri:
Fn::Sub:
- arn:aws:apigateway:${AWS::Region}:s3:path/${BucketName}/input/{filename}
- BucketName: !Ref TextExtractBucket
PassthroughBehavior: WHEN_NO_TEMPLATES
Credentials: !Ref RoleArn
RequestParameters:
integration.request.path.filename: 'method.request.path.filename'
IntegrationResponses:
- StatusCode: 200
ResponseParameters:
method.response.header.Access-Control-Allow-Origin: "'*'"
MethodResponses:
- StatusCode: 200
ResponseParameters:
method.response.header.Access-Control-Allow-Origin: true
CsvGetMethod:
Type: AWS::ApiGateway::Method
Properties:
RestApiId: !Ref CsvApiGateway
ResourceId: !Ref CsvResource
HttpMethod: 'GET'
AuthorizationType: 'NONE'
Integration:
Type: 'AWS_PROXY'
IntegrationHttpMethod: 'POST'
Uri:
Fn::Sub:
- arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${lambdaArn}/invocations
- lambdaArn: !GetAtt CsvDownloadLambda.Arn
# Deployment
TextExtractApiDeployment:
Type: AWS::ApiGateway::Deployment
DependsOn: UploadMethod
Properties:
RestApiId: !Ref TextExtractApiGateway
Description: Deployment for the PDF Upload API
StageName: prod
CsvApiDeployment:
Type: AWS::ApiGateway::Deployment
DependsOn: CsvGetMethod
Properties:
RestApiId: !Ref CsvApiGateway
Description: Deployment for the CSV Download API
StageName: prod
# General
TextExtractSNSTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: AmazonTextractTopic
Subscription:
- Protocol: lambda
Endpoint: !GetAtt CSVGeneratorLambda.Arn
# SSM Parameter
UploadUrlParameter:
Type: AWS::SSM::Parameter
DependsOn: TextExtractApiDeployment
Properties:
Name: UploadUrl
Type: String
Value: !Sub "https://${TextExtractApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod"
DownloadUrlParameter:
Type: AWS::SSM::Parameter
DependsOn: CsvApiDeployment
Properties:
Name: DownloadUrl
Type: String
Value: !Sub "https://${CsvApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod/download"
# Output
Outputs:
TextExtractBucketName:
Description: The name of the S3 bucket
Value: !Ref TextExtractBucket
TextDetectionLambdaFunctionArn:
Description: The ARN of the TextDetectionLambda function
Value: !GetAtt TextDetectionLambda.Arn
CSVGeneratorLambdaFunctionArn:
Description: The ARN of the CSVGeneratorLambda function
Value: !GetAtt CSVGeneratorLambda.Arn
TextExtractApiUrl:
Description: The URL of the TextExtract API
Value: !Sub "https://${TextExtractApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod/upload/{filename}"
Export:
Name: TextExtractApiUrl
InstanceId:
Description: The Instance ID
Value: !Ref Frontend
CsvApiUrl:
Description: URL for the CSV Download API
Value: !Sub "https://${CsvApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod/download"
You can watch the process step by step on how to run cloudformation script on AWS in my YouTube video.
That’s all for this project! I hope you found this guide helpful and are now able to extract text from PDF files and store it in CSV files using AWS services. Remember, the key to mastering any technology is practice. So, don’t hesitate to get your hands dirty and experiment with the code.
If you have any questions or run into any issues, feel free to reach out to me. I’ll do my best to help. Stay tuned for more exciting projects and keep exploring the endless possibilities of cloud computing with AWS. Happy coding! 💻🚀
Link to the Source Code.