How to find the oldest file in an S3 bucket

This python example searches for the oldest file in a bucket and prints same.

 

Ensure to pip install boto3

 

import boto3

# Specify the bucket name
bucket_name = 'my-bucket'

s3 = boto3.resource('s3')

bucket = s3.Bucket(bucket_name)

# Initialize oldest_file and oldest_date to None
oldest_file = None
oldest_date = None

# Iterate through all files in the bucket
for obj in bucket.objects.all():
    # If oldest_file is None, or this file was modified before the oldest file
    # then update oldest_file and oldest_date to this file's name and last_modified date
    if oldest_file is None or obj.last_modified < oldest_date:
        oldest_file = obj.key
        oldest_date = obj.last_modified

# Print the oldest file's name and its last_modified date
if oldest_file is not None:
    print('The oldest file is {0} and was last modified on {1}'.format(oldest_file, oldest_date))
else:
    print('No files in bucket')

 

Configuring JetBrains Rider to use AWS Lambda Test Tool

eveloping serverless applications using AWS Lambda can be an exciting journey into the world of event-driven programming. However, testing and debugging these applications can be a slightly more intricate process than usual. Fortunately, AWS provides a valuable tool known as the 'AWS Lambda Test Tool' that simplifies local testing and debugging of your Lambda functions.

In this blog post, we will walk you through configuring JetBrains Rider IDE to work with the AWS Lambda Test Tool. While this guide focuses on one specific method of setup, it's important to note that there are multiple ways to achieve local testing of AWS Lambda functions. Rider itself has excellent Lambda tooling capabilities that work with Docker, providing a more integrated and managed environment for your development needs.

We'll be using the latest version of Rider (2023.1.2) for this guide.

Prerequisites

To follow this guide, you will need:

  1. JetBrains Rider installed on your machine. If you don't have it installed, you can download it from the JetBrains official website.
  2. The AWS Toolkit for Rider, which is a plugin that integrates AWS resources into your Rider development workflow.
  3. The .NET Core SDK.
  4. AWS Lambda Test Tool.

Steps to Configure

Step 1: Install AWS Lambda Test Tool

The AWS Lambda Test Tool is a .NET Core Global Tool. You can install it using the following command:

dotnet tool install -g Amazon.Lambda.TestTool-6.0

Step 2: Configure Rider

Once the AWS Lambda Test Tool is installed, we need to configure Rider to use it. This involves modifying the launch settings, which can be found within your project folder, typically at the following path:

YOUR_PROJECT_NAME/Properties/launchSettings.json

Open the launchSettings.json file in Rider, and replace the existing content with the following JSON:

{
  "profiles": {
    "Mock Lambda Test Tool Rider": {
      "commandName": "Executable",
      "commandLineArgs": "<YOUR_PATH_WILL_VARY>\\Amazon.Lambda.TestTool.BlazorTester.dll --port 5050",
      "workingDirectory": "$(ProjectDir)",
      "executablePath": "dotnet"
    }
  }
}

Please replace <YOUR_PATH_WILL_VARY> with the actual path to the Amazon.Lambda.TestTool.BlazorTester.dll on your system. Save your changes and close the file.

Step 3: Test Your Setup

With all configurations set, it's time to test your setup. Go back to Rider and from the Run menu, select Edit Configurations... Here, you should be able to see the Mock Lambda Test Tool Rider profile you just added.

Select this profile and then click on the 'Run' button. This should launch the AWS Lambda Test Tool, and a new browser window should open at http://localhost:5050 where you can interact with your Lambda function locally.

Alternate Method: Using Docker with Rider

As previously mentioned, Rider has built-in support for Docker, which can be leveraged to create a more controlled environment for testing your AWS Lambda functions. This approach would involve using AWS's docker-lambda images to replicate the live AWS Lambda environment and debug your functions.

This approach offers several benefits, such as:

  1. Running your function in an environment that closely mimics the live AWS Lambda environment.
  2. Simulating event sources like Amazon S3, Amazon DynamoDB, and others.
  3. Easier management of dependencies and environment variables.

To use this approach, you'll need Docker installed

on your machine, and you may also need to adjust your Rider settings to enable Docker support.

Conclusion

The AWS Lambda Test Tool greatly facilitates the local testing of serverless applications. Coupled with the powerful features of JetBrains Rider, integrating this testing process into your development workflow becomes a breeze. While we have covered one particular setup method in this guide, it's important to remember that there are multiple ways to achieve local testing of AWS Lambda functions, including leveraging Docker with Rider's built-in support. Happy coding!

Title: Getting Started with AWS Glue: A Comprehensive Guide


Introduction

AWS Glue is a fully managed, serverless data integration service offered by Amazon Web Services (AWS) that simplifies the process of extracting, transforming, and loading (ETL) data for analytics purposes. With its scalable, pay-as-you-go model, and a wide range of built-in features, AWS Glue has become a popular choice for data engineers and analysts to streamline their data workflows. In this blog post, we'll walk you through the process of getting started with AWS Glue, from setting up the necessary components to running your first ETL job.

Understanding AWS Glue Components

Before diving into AWS Glue, it's essential to understand its core components:

a. AWS Glue Data Catalog - A central metadata repository that stores information about your data sources, transformations, and targets. The Data Catalog helps manage and discover data assets across various data stores.

b. AWS Glue Crawlers - Automated programs that connect to your data source, extract metadata, and store it in the Data Catalog.

c. AWS Glue ETL Jobs - Scripts that read data from a source, apply transformations, and write the output to a target. These jobs are written in either Python or Scala and run on AWS Glue's distributed, serverless Apache Spark environment.

d. AWS Glue Triggers - Event-driven mechanisms that can start, stop, or chain ETL jobs based on a schedule or the completion of another job.

Setting Up AWS Glue

To get started with AWS Glue, you'll need to perform the following steps:

a. Sign in to your AWS Management Console and navigate to the AWS Glue service.

b. Set up an AWS Identity and Access Management (IAM) role for AWS Glue. This role defines the permissions required to access the necessary resources, such as data stores and Amazon S3 buckets.

c. Create an Amazon S3 bucket to store your data, scripts, and output files. Make sure to configure appropriate access permissions.

Creating and Running a Crawler

A crawler connects to your data source, extracts metadata, and creates table definitions in the Data Catalog. To create a crawler:

a. In the AWS Glue Console, navigate to Crawlers and click "Add Crawler."

b. Provide a name, description, and choose the IAM role created earlier.

c. Configure the data store and connection settings, such as the data source type (e.g., S3, JDBC), path or connection URL, and any necessary authentication information.

d. Choose or create a database in the Data Catalog to store the table definitions.

e. Configure a schedule for the crawler to run (e.g., on-demand, hourly, daily).

f. Review the configuration and create the crawler. You can now run the crawler to populate the Data Catalog with table definitions.

Creating and Running an ETL Job

Now that your Data Catalog is populated, you can create an ETL job to process the data:

a. In the AWS Glue Console, navigate to Jobs and click "Add Job."

b. Provide a name, description, and select the IAM role created earlier.

c. Choose a data source and target from the Data Catalog.

d. Select an ETL language (Python or Scala) and configure the job properties, such as the number of data processing units (DPUs) and timeout.

e. Write or generate an ETL script to define the transformations. AWS Glue can auto-generate a script based on the selected source and target, but you may need to customize it to meet your requirements.

f. Save and run the job. Monitor the progress and view the output in the specified target location.


Automating ETL Workflows with Triggers

To automate your ETL workflows, you can use triggers to start, stop, or chain jobs based on specific conditions:

a. In the AWS Glue Console, navigate to Triggers and click "Add Trigger."

b. Provide a name, description, and select a trigger type (schedule, job event, or on-demand).

c. If you choose a schedule-based trigger, configure the schedule (e.g., cron expression). For a job event-based trigger, select the parent job(s) that should trigger the current job upon completion.

d. Add the job(s) that you want to trigger, and set any conditions (e.g., run only if the parent job succeeds).

e. Review the configuration and create the trigger.


Monitoring and Troubleshooting

AWS Glue provides various monitoring and troubleshooting features to help you manage your ETL jobs:

a. Use AWS Glue Console's job history and logs to track job progress, view runtime statistics, and analyze errors.

b. Enable Amazon CloudWatch metrics and alarms for monitoring job performance and sending notifications based on specific thresholds.

c. Access the underlying Apache Spark logs and UI for a more in-depth analysis of your ETL job execution.


Conclusion

In this blog post, we've introduced you to AWS Glue, its core components, and the process of setting up and running ETL jobs. By leveraging AWS Glue's serverless, pay-as-you-go model, you can streamline your data integration workflows and focus on deriving valuable insights from your data. Don't hesitate to explore AWS Glue further and dive deeper into its advanced features to make the most out of this powerful data integration service.


Disclaimer: Generated by GPT but checked by a Brian.

Http Header Propagation Asp net 6 with HttpClientFactory


Today I show you how to add header propagation in ASP net.

“Header what?” I hear you say.


Essentially it’s a mechanism whereby when you make Http requests via a HttpClient, you can automatically ‘forward’ headers that were issued to your endpoint.

HeaderX that is passed to an ASP net endpoint gets added to out going http requests via a http client.


How

Manually:


If we were to do this manually, we would interrogate the http request, find the HeaderX and add it in the outgoing HttpClient request headers.


Propagation:

Lets get ASP to do the heavy lifting

1) Add the Microsoft.AspNetCore.HeaderPropagation package


2) Add Header propagation (in Startup.cs or program.cs (minimal api))

3) Use Header Propagation (in Startup.cs or program.cs (minimal api))


4) Create a service that takes HttpClient as a constructor arg

4) Add this AwesomeService to your DI config and set the delegated headers

builder.Services.AddTransient<IAwesomeService, AwesomeService>();
builder.Services.AddHttpClient<IAwesomeService, AwesomeService>(o => o.Timeout = TimeSpan.FromMinutes(1))
.AddHeaderPropagation(o => o.Headers.Add("Header1"));

Note: this step is only needed with HttpClientFactory, you can see for the IAwesomeService only the “Header1” will be propagated even though we’re configuring Header1 and Header2 in general.

5) Testing
One way of testing this is to use Fiddler,
a) Enable the proxy to be seen by .net core


b) Open Fiddler
I find it’s easier to filter on certain hosts


c) Enter the composer and call (execute) your service with some headers

d) Inspect that you receive the Header1


Enjoy!

Parallel Batch Request Handling

 

Picture this:

You find yourself with a big list of resource identifiers

You want to make get information for each id via a http request

You want to batch the resource identifiers rather than making individual http calls

You want to make multiple parallel requests for these batches

You don’t want to manage multiple thread access to a single response collection

 

You have come to the right place!

 

There are multiple approaches I can think of, but here’s one that may fit your use case:

 

Get Method:

The code in the method itself is not that important!
what you should take from it is that it’s an ordinary async/await method that takes a list of Ids and returns the results for same.

 

To Parallelizm and beyond

 

Let’s unravel what’s happening above.

 

Firstly we are using .net6 where we have the Chunk method from the framework (gone are the days of writing this by hand thankfully!),
the chunk method basically takes the ids list and breaks it into a list of smaller lists of size ‘batchSize’ or smaller.

e.g.

If we chunked [1,2,3,4,5,6,7,8,9,10,11] with a batch size of 2, then we’d end up with

[[1,2],[3,4],[5,6],[7,8],[9,10],[11]]

 

 

Secondly we pass these smaller arrays to the GetIds call, by using a Linq Select expression.

We await the result of all these Selected tasks via Task.WhenAll

Lastly to combine all the individual batched responses we use the Linq SelectMany expression.

 

I like this approach as it is terse, concise and propagates exceptions without wrapping them in Aggregates.

Await forever – deadlocked so easily

Async/Await simplifies async code, use it everywhere and life becomes so simple right?
While this is true I’ve seen situations where users either chose to, or had to, mix async and non async code and got themselves into a world of problems.

 

One problem I’ve seen time over time is with Windows Desktop applications where a simple blocking call on a Task can deadlock an application entirely, here I demonstrate the problem with a contrived Windows Forms example.

Application simply downloads some html asynchronously and displays it in a web browser

Implementation of async function is:

(Let’s ignore the urge to make the click handler async, imagine the async call was in the form constructor if you must)


How can such a simple bit of code deadlock the windows application?

Well the problem occurs because of how Async/Await state machines work.
I’m really going to simplify this explanation as I want people to grasp it (so grit ur teeth if you already know the detail Winking smile)

 

The async keyworks is simply a compile instruction that doesn’t do much so lets ingore that and focus on the await call

The await calls an async function then waits on a call back, when the call back occurs the code resumes to the next step…

ok so far so good, this is what we expect… simple right…wrong!

 

A quick recap of windows UI threads and messages

Before we continue let’s have a quick recap of windows UI threads and message loops.
A message loop is an obligatory section of code in every program that uses a graphical user interface under Microsoft Windows. Windows programs that have a GUI are event-driven. Windows maintains an individual message queue for each thread that has created a window.

Now as anyone working on a windows application knows you always call any code that updates the UX on the GUI thread, try with any other threads and you’ll get presented with a cross thread exception.

Windows forms application code can call Invoke/BeginInvoke on a windows control and execute the code back on the GUI Thread, in WPF we would use a dispatcher, in UWP/Wintr it’s something else.

Another approach is to use the SyncronizationContext contruct, this is aware if it should call Invoke or Dispatch or something else on our behalf.

 

Back to await

The callback I mentioned above is smart in that it tries to use the existing Syncronization context if it exists, so when that await finally returns we’re back on the GUI thread can can update the UX without those pesky errors.
In our calling code above we never left the GUI thread as we were blocked on the Result of the task.

The crux of the problem is that the await call back puts a message on the windows message queue to tell it to continue, but the message queue is bocked in that call to .Result on the task, so we’re well and truly deadlocked.

 

Solutions

I’ll avoid telling you to embrace async/await everywhere! and offer some alternate solutions..

1) ConfigureAwait(false)


This works as it tells the task not to continue on the current syncroniztion context (which let’s remember is the GUI thread), another context is used to complete the async await state machine call back and allow the task to be completed.

2) Run on another syncronization context you create

 

3) Use Async/Await everywhere

Summary

This is a very dumbed down explanation of how you might encounter a deadlock with async await.

If it happens then don’t panic it can be easily fixed once you know what’s happening, best practise is nearly never to work around the problem and always use async await entirely.

Standard documentation is great there are lots for really good articles floating about: e.g. https://devblogs.microsoft.com/dotnet/configureawait-faq/