Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto. Amazon EMR makes it easy to set up, operate, and scale our big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters. With EMR we can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. You can run workloads on Amazon EC2 instances, on Amazon Elastic Kubernetes Service clusters, or on-premises using EMR on AWS Outposts.
Easy to use
Analysts, data engineers, and data scientists can use EMR Notebooks to collaborate and interactively explore, process, and visualize data. Simply specify the version of EMR applications and type of compute we want to use. EMR takes care of provisioning, configuring, and tuning clusters so that we can focus on running analytics.
EMR pricing is simple and predictable: We pay a per-instance rate for every second used, with a one-minute minimum charge. We can launch a 10-node EMR cluster for as little as $0.15 per hour. We can save 50-80% on the cost of the instances by selecting Amazon EC2 Spot for transient workloads and Reserved Instances for long-running workloads.
Unlike the rigid infrastructure of on-premises clusters, EMR decouples compute and storage, giving us the ability to scale each independently and take advantage of the tiered storage of Amazon S3. With EMR, we can provision one, hundreds, or thousands of compute instances or containers to process data at any scale. The number of instances can be increased or decreased automatically using Auto Scaling (which manages cluster sizes based on utilization) and we only pay for what we use.
Spend less time tuning and monitoring your cluster. EMR is tuned for the cloud and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. Clusters are highly available and automatically failover in the event of a node failure. EMR provides the latest stable open source software releases, so you don’t have to manage updates and bug fixes, which leads to fewer issues and less effort to maintain our environment.
EMR automatically configures EC2 firewall settings, controlling network access to instances and launches clusters in an Amazon Virtual Private Cloud (VPC). Server-side encryption or client-side encryption can be used with the AWS Key Management Service or our own customer-managed keys. EMR makes it easy to enable other encryption options, like in-transit and at-rest encryption, and strong authentication with Kerberos. We can use AWS Lake Formation or Apache Ranger to apply fine-grained data access controls for databases, tables, and columns.
We have complete control over our EMR clusters and our individual EMR jobs. We can launch EMR clusters with custom Amazon Linux AMIs and easily configure the clusters using scripts to install additional third party software packages. EMR enables us to reconfigure applications on running clusters on the fly without the need to relaunch clusters. Also, we can customize the execution environment for individual jobs by specifying the libraries and runtime dependencies in a Docker container and submit them with our job.
Amazon EMR on Amazon EC2
We can deploy EMR on Amazon EC2 and take advantage of On-Demand, Reserved, and Spot Instances. EMR manages provisioning, management, and scaling of the EC2 instances. AWS offers more instance options than any other cloud provider, allowing us to choose the instance that gives us the best performance or cost for our workload.
Amazon EMR on Amazon EKS
We can use EMR to run Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (EKS), without needing to provision EMR clusters, to improve resource utilization and simplify infrastructure management. Amazon EKS gives you the flexibility to start, run, and scale Kubernetes applications in the AWS cloud or on-premises. With Amazon EMR on EKS, we can share compute and memory resources across all of our applications and use a single set of Kubernetes tools to centrally monitor and manage our infrastructure.
Amazon EMR on AWS Outposts
Amazon EMR is available on AWS Outposts, allowing us to set up, deploy, manage, and scale EMR in our on-premises environments, just as we would in the cloud. AWS Outposts brings AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility.
Use EMR’s built-in machine learning tools, including Apache Spark MLlib, TensorFlow, and Apache MXNet for scalable machine learning algorithms, and use custom AMIs and bootstrap actions to easily add our preferred libraries and tools to create our own predictive analytics toolset.
Extract, transform, load (ETL)
EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as sort, aggregate, and join on large datasets.
Analyze clickstream data from Amazon S3 using Apache Spark and Apache Hive to segment users, understand user preferences, and deliver more effective ads.
Analyze events from Apache Kafka, Amazon Kinesis, or other streaming data sources in real-time with Apache Spark Streaming and Apache Flink to create long-running, highly available, and fault-tolerant streaming data pipelines on EMR. Persist transformed data sets to S3 or HDFS and insights to Amazon Elasticsearch Service.
EMR Notebooks provide a managed analytic environment based on open-source Jupyter that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analyses.
EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS.
Thank you for reading this article, I really appreciate it. If you have any question feel free to leave a comment.