{"id":43955,"date":"2019-06-13T19:35:40","date_gmt":"2019-06-13T16:35:40","guid":{"rendered":"https:\/\/www.altoros.com\/blog\/?p=43955"},"modified":"2019-06-13T19:35:40","modified_gmt":"2019-06-13T16:35:40","slug":"optimizing-the-performance-of-apache-spark-queries","status":"publish","type":"post","link":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/","title":{"rendered":"Optimizing the Performance of Apache Spark Queries"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-transparent ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#A_testing_environment\" >A testing environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#What_can_be_optimized\" >What can be optimized?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#Optimization_results\" >Optimization results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#Further_reading\" >Further reading<\/a><\/li><\/ul><\/nav><\/div>\n<h3><span class=\"ez-toc-section\" id=\"A_testing_environment\"><\/span>A testing environment<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><a href=\"https:\/\/spark.apache.org\/\" rel=\"noopener noreferrer\" target=\"_blank\">Apache Spark<\/a> is a popular technology for processing, managing, and analyzing big data. It is a unified analytics engine with built-in modules for SQL, stream processing, machine learning, and graph processing.<\/p>\n<p>In this post, we will explore optimization techniques, which improve query run times for two particular modules of the technology: <a href=\"https:\/\/spark.apache.org\/documentation.html\" rel=\"noopener noreferrer\" target=\"_blank\">Spark Core<\/a> and <a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html\" rel=\"noopener noreferrer\" target=\"_blank\">Spark SQL<\/a>. In its turn, Spark SQL comprises two components: <em>pure Spark SQL<\/em>, which will also be under investigation, and <em>DataFrame API<\/em>.<\/p>\n<p>As a testing architecture, we set up a Spark cluster of a master and three workers using the Spark Standalone mode. The Apache Spark application starts by submitting a JAR file to the master, which then assigns tasks to the workers. The application reads data from the Google Cloud Platform storage.<\/p>\n<p><center><a href=\"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/architecture-of-apache-spark-cluster.png\"><img decoding=\"async\" src=\"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/architecture-of-apache-spark-cluster.png\" alt=\"\" width=\"640\" class=\"aligncenter size-full wp-image-44117\" \/><\/a><small>An architecture of the Apache Spark cluster with a master and three workers<\/small><\/center><\/p>\n<p>To measure query run time, the following command was used (see the full source code in <a href=\"https:\/\/github.com\/Altoros\/Spark-Optimization-Tutorial\" rel=\"noopener noreferrer\" target=\"_blank\">this GitHub repo<\/a>):<\/p>\n<pre style=\"padding-left: 20px;\"><strong>SELECT<\/strong> u.id, <strong>count<\/strong>(distinct c.id) <strong>FROM<\/strong> users <strong>AS<\/strong> u <strong>INNER JOIN<\/strong> comments <strong>AS<\/strong> c <strong>ON<\/strong> u.id = c.user_id <strong>INNER JOIN<\/strong> posts <strong>AS<\/strong> p <strong>ON<\/strong> p.owner_user_id = u.id <strong>WHERE<\/strong> u.reputation > 1 <strong>AND<\/strong> c.post_id = p.id <strong>GROUP BY<\/strong> u.id <strong>ORDER BY<\/strong> <strong>count<\/strong>(<strong>distinct<\/strong> c.id) <strong>DESC<\/strong><\/pre>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"What_can_be_optimized\"><\/span>What can be optimized?<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><b>Spark Core<\/b><\/p>\n<p>Using default settings, Spark Core has the slowest processing time among the three investigated components. This can be optimized through changes to resilient distributed data set (RDD) and serialization.<\/p>\n<p>Since the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">User<\/code> RDD is small enough to fit in the memory of each worker, it can be transformed into a broadcast variable. This turns the entire operation into a so called <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">map side join<\/code> for a large RDD, which doesn&#8217;t need to be shuffled this way. The <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">User<\/code> RDD will then be converted into a typical <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">Map<\/code> and will be broadcasted on each worker node as a variable.<\/p>\n<p>The <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">Post<\/code> RDD can be partitioned before joining the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">User<\/code> RDD via the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">partitionBy(new HashPartitionBy(25))<\/code> method. This helps to reduce shuffling, as it will be predefined for <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">Post<\/code> RDD in future transformations and joins.<\/p>\n<p>Some of the RDD\u2019s methods use variables in the code. For example, there&#8217;s the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">filter(user -> user.getReputation() > 1)<\/code> variable, which should be broadcasted to take a value from a local virtual machine instead of getting it from a driver. Then, the driver stores the broadcasted filter variables on each worker node. In this case, each task stops polling the value of the variable and gets it locally.<\/p>\n<p>Next, Apache Spark uses a Java serializer by default, which has mediocre performance. This can be replaced with the Kyro serializer, once the following properties are set:<\/p>\n<ul>\n<li><code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">spark.serializer<\/code> equals <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">org.apache.spark.serializer.KryoSerializer<\/code><\/li>\n<li><code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">spark.kryoserializer.buffer.max<\/code> equals 128 mebibytes<\/li>\n<li><code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">spark.kryoserializer.buffer<\/code> equals 64 mebibytes<\/li>\n<\/ul>\n<p>Additionally, the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">User<\/code> classes should be registered in <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">registerKryoClasses<\/code>, otherwise it will not affect the serialization process.<\/p>\n<p>&nbsp;<br \/>\n<b>Pure Spark SQL<\/b><\/p>\n<p>Before optimization, pure Spark SQL actually has decent performance. Still, there are some slow processes that can be sped up, including:<\/p>\n<ul>\n<li><code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">Shuffle.partitions<\/code><\/li>\n<li><code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">BroadcastHashJoin<\/code><\/li>\n<\/ul>\n<p>First, pure Spark SQL has 200 <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">shuffle.partitions<\/code> by default, meaning there will be 200 completed tasks, where each task processes equal amounts of data. Since Apache Spark spends time executing extra operations for each task, such as serializations, deserializations, etc., decreasing the number of <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">shuffle.partitions<\/code> to 25 will significantly shorten query run times.<\/p>\n<p>Second, pure Spark SQL uses <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">SortMergeJoin<\/code> for the <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">JOIN<\/code> operation by default. Compared to <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">BroadcastHashJoin<\/code>, <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">SortMergeJoin<\/code> does not use a lot of RAM, but processing queries takes longer. If the amount of RAM available is enough for storing data, <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">BroadcastHashJoin<\/code> becomes the optimal choice for faster data processing.<\/p>\n<p>To enable <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">BroadcastHashJoin<\/code>, the value of <code style=\"color: #222222; background-color: #e6e6e6; padding: 1px 2px;\">autoBroadcastJoinThreshold<\/code> should be increased to match the size of the filtered data set being queried.<\/p>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Optimization_results\"><\/span>Optimization results<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>While using Spark Core, developers should be well aware of the Spark working principles. Otherwise, the ignorance of them can lead to inefficient run times and system downtimes. After the implementation of various optimization techniques, the job run time was decreased by 33.3%.<\/p>\n<p>The investigation demonstrated that pure Spark SQL showed the best results out of the three modules before implementing any optimization techniques. By applying basic optimization, the results were improved by 13.3%.<\/p>\n<p>Although technologically similar, DataFrame API can&#8217;t boast of the same processing time as pure Spark SQL due to the amount of data aggregated. DataFrame API processes all the data from the tables, which significantly increases job run time. With optimization applied, we improved the running time by 54%, making it similar to pure Spark SQL.<\/p>\n<p><center><a href=\"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/results-of-optimizing-apache-spark-modules.png\"><img decoding=\"async\" src=\"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/results-of-optimizing-apache-spark-modules-1024x633.png\" alt=\"\" width=\"640\" class=\"aligncenter size-large wp-image-44212\" \/><\/a><small>The results of optimizing the three Apache Spark modules<\/small><\/center><\/p>\n<p>While default implementations of Apache Spark can be optimized to work faster, it is important to note that each Apache Spark\u2013based project is unique and requires a customized approach dependent on system requirements and parameters. In this regard, the values suggested above are based on our own tests with Apache Spark.<\/p>\n<p>To learn more about how we optimized our Apache Spark clusters, including DataFrame API, as well as what hardware configuration were used, check out the full <a href=\"https:\/\/www.altoros.com\/research-papers\/essential-optimization-methods-to-make-apache-spark-work-faster\/\">research paper<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Further_reading\"><\/span>Further reading<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul>\n<li><a href=\"https:\/\/www.altoros.com\/blog\/multi-cluster-deployment-options-for-apache-kafka-pros-and-cons\/\">Multi-Cluster Deployment Options for Apache Kafka: Pros and Cons<\/a><\/li>\n<li><a href=\"https:\/\/www.altoros.com\/blog\/using-multi-threading-to-build-neural-networks-with-tensorflow-and-apache-spark\/\">Using Multi-Threading to Build Neural Networks with TensorFlow and Apache Spark<\/a><\/li>\n<\/ul>\n<hr \/>\n<p><center><small>This blog post was written by <a href=\"https:\/\/github.com\/ayudovin\/\" rel=\"noopener noreferrer\" target=\"_blank\">Artsiom Yudovin<\/a> and <a href=\"https:\/\/www.altoros.com\/blog\/author\/carlo\/\" rel=\"noopener noreferrer\" target=\"_blank\">Carlo Gutierrez<\/a>,<br \/>\nedited by <a href=\"https:\/\/www.altoros.com\/blog\/author\/sophie.turol\/\" rel=\"noopener noreferrer\" target=\"_blank\">Sophia Turol<\/a> and <a href=\"https:\/\/www.altoros.com\/blog\/author\/alex\/\">Alex Khizhniak<\/a>.<\/small><\/center><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A testing environment<\/p>\n<p>Apache Spark is a popular technology for processing, managing, and analyzing big data. It is a unified analytics engine with built-in modules for SQL, stream processing, machine learning, and graph processing.<\/p>\n<p>In this post, we will explore optimization techniques, which improve query run times for two particular modules of [&#8230;]<\/p>\n","protected":false},"author":32,"featured_media":44219,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":"","_links_to":"","_links_to_target":""},"categories":[214],"tags":[894,895],"class_list":["post-43955","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tutorials","tag-benchmarking","tag-research-and-development"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Optimizing the Performance of Apache Spark Queries | Altoros<\/title>\n<meta name=\"description\" content=\"Learn how the run times of Spark Core and Spark SQL queries can be improved by speeding up slow processes and optimizing serialization tasks.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing the Performance of Apache Spark Queries | Altoros\" \/>\n<meta property=\"og:description\" content=\"A testing environment Apache Spark is a popular technology for processing, managing, and analyzing big data. It is a unified analytics engine with built-in modules for SQL, stream processing, machine learning, and graph processing. In this post, we will explore optimization techniques, which improve query run times for two particular modules of [...]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/\" \/>\n<meta property=\"og:site_name\" content=\"Altoros\" \/>\n<meta property=\"article:published_time\" content=\"2019-06-13T16:35:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif\" \/>\n\t<meta property=\"og:image:width\" content=\"640\" \/>\n\t<meta property=\"og:image:height\" content=\"396\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/gif\" \/>\n<meta name=\"author\" content=\"Carlo Gutierrez\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Carlo Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/\"},\"author\":{\"name\":\"Carlo Gutierrez\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/#\\\/schema\\\/person\\\/833e109f77de753b2b472dca0236b442\"},\"headline\":\"Optimizing the Performance of Apache Spark Queries\",\"datePublished\":\"2019-06-13T16:35:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/\"},\"wordCount\":805,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/optimizing-apache-spark.gif\",\"keywords\":[\"Benchmarking\",\"Research and Development\"],\"articleSection\":[\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/\",\"url\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/\",\"name\":\"Optimizing the Performance of Apache Spark Queries | Altoros\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/optimizing-apache-spark.gif\",\"datePublished\":\"2019-06-13T16:35:40+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/#\\\/schema\\\/person\\\/833e109f77de753b2b472dca0236b442\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/optimizing-apache-spark.gif\",\"contentUrl\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/06\\\/optimizing-apache-spark.gif\",\"width\":640,\"height\":396},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/optimizing-the-performance-of-apache-spark-queries\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing the Performance of Apache Spark Queries\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/\",\"name\":\"Altoros\",\"description\":\"Insight\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/#\\\/schema\\\/person\\\/833e109f77de753b2b472dca0236b442\",\"name\":\"Carlo Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CG_portrait-2-96x96.jpg\",\"url\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CG_portrait-2-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/CG_portrait-2-96x96.jpg\",\"caption\":\"Carlo Gutierrez\"},\"description\":\"Carlo Gutierrez is a Technical Writer at Altoros. As part of the editorial team, his focus has been on emerging technologies such as Cloud Foundry, Kubernetes, blockchain, and the Internet of Things. Prior to Altoros, he primarily wrote about enterprise and consumer technology. Carlo has over 12 years of experience in the publishing industry. Previously, he served as an Editor for PC World Philippines and Questex Asia, as well as a Designer for Tropa Entertainment.\",\"url\":\"https:\\\/\\\/www.altoros.com\\\/blog\\\/author\\\/carlo\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing the Performance of Apache Spark Queries | Altoros","description":"Learn how the run times of Spark Core and Spark SQL queries can be improved by speeding up slow processes and optimizing serialization tasks.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing the Performance of Apache Spark Queries | Altoros","og_description":"A testing environment Apache Spark is a popular technology for processing, managing, and analyzing big data. It is a unified analytics engine with built-in modules for SQL, stream processing, machine learning, and graph processing. In this post, we will explore optimization techniques, which improve query run times for two particular modules of [...]","og_url":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/","og_site_name":"Altoros","article_published_time":"2019-06-13T16:35:40+00:00","og_image":[{"width":640,"height":396,"url":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif","type":"image\/gif"}],"author":"Carlo Gutierrez","twitter_misc":{"Written by":"Carlo Gutierrez","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#article","isPartOf":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/"},"author":{"name":"Carlo Gutierrez","@id":"https:\/\/www.altoros.com\/blog\/#\/schema\/person\/833e109f77de753b2b472dca0236b442"},"headline":"Optimizing the Performance of Apache Spark Queries","datePublished":"2019-06-13T16:35:40+00:00","mainEntityOfPage":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/"},"wordCount":805,"commentCount":0,"image":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#primaryimage"},"thumbnailUrl":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif","keywords":["Benchmarking","Research and Development"],"articleSection":["Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/","url":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/","name":"Optimizing the Performance of Apache Spark Queries | Altoros","isPartOf":{"@id":"https:\/\/www.altoros.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#primaryimage"},"image":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#primaryimage"},"thumbnailUrl":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif","datePublished":"2019-06-13T16:35:40+00:00","author":{"@id":"https:\/\/www.altoros.com\/blog\/#\/schema\/person\/833e109f77de753b2b472dca0236b442"},"breadcrumb":{"@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#primaryimage","url":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif","contentUrl":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2019\/06\/optimizing-apache-spark.gif","width":640,"height":396},{"@type":"BreadcrumbList","@id":"https:\/\/www.altoros.com\/blog\/optimizing-the-performance-of-apache-spark-queries\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.altoros.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Optimizing the Performance of Apache Spark Queries"}]},{"@type":"WebSite","@id":"https:\/\/www.altoros.com\/blog\/#website","url":"https:\/\/www.altoros.com\/blog\/","name":"Altoros","description":"Insight","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.altoros.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.altoros.com\/blog\/#\/schema\/person\/833e109f77de753b2b472dca0236b442","name":"Carlo Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2021\/02\/CG_portrait-2-96x96.jpg","url":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2021\/02\/CG_portrait-2-96x96.jpg","contentUrl":"https:\/\/www.altoros.com\/blog\/wp-content\/uploads\/2021\/02\/CG_portrait-2-96x96.jpg","caption":"Carlo Gutierrez"},"description":"Carlo Gutierrez is a Technical Writer at Altoros. As part of the editorial team, his focus has been on emerging technologies such as Cloud Foundry, Kubernetes, blockchain, and the Internet of Things. Prior to Altoros, he primarily wrote about enterprise and consumer technology. Carlo has over 12 years of experience in the publishing industry. Previously, he served as an Editor for PC World Philippines and Questex Asia, as well as a Designer for Tropa Entertainment.","url":"https:\/\/www.altoros.com\/blog\/author\/carlo\/"}]}},"_links":{"self":[{"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/posts\/43955","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/users\/32"}],"replies":[{"embeddable":true,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/comments?post=43955"}],"version-history":[{"count":82,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/posts\/43955\/revisions"}],"predecessor-version":[{"id":44253,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/posts\/43955\/revisions\/44253"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/media\/44219"}],"wp:attachment":[{"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/media?parent=43955"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/categories?post=43955"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.altoros.com\/blog\/wp-json\/wp\/v2\/tags?post=43955"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}