Five quick links: Managing big data in the cloud
Cloud computing can streamline big data management and analysis. But, before diving in, be sure to know the latest in cloud and big data technology.
Big data can be a big problem for traditional IT systems, as processing massive amounts of structured and unstructured data is a lot to handle. However, cloud computing allows IT to manage big data sets without monopolizing on-premises systems.
To effectively manage big data in the cloud, it's important to know the latest tools and services. Hadoop, for example, is a common Apache framework for big data. Additionally, many major cloud providers have their own big data services, such as Amazon Web Services' Elastic MapReduce, Google's BigQuery and Pivotal's Big Data Suite.
Here are five quick links for exploring big data in the cloud -- from the basics to helpful tools and services.
- Which cloud model is best for your big data project?
Public, private and hybrid clouds have their benefits. Public cloud provides elasticity and scalability in a pay-per-use structure. Private cloud, based on on-premises infrastructure, offers enterprise control. And hybrid cloud is the mixture of private and public cloud services with orchestration between the two. But, when it comes to choosing the right cloud model for big data, it's important to dig deeper into each.
While tighter control makes it enticing, private cloud's on-premises nature isn't ideal for big data. Public cloud, instead, is a good fit for on-demand big data tasks. However, potential bandwidth limitations and data transfer costs could be a cause for concern.
- Entry-level big data analysis with Google BigQuery
A big data project is a significant undertaking for any organization. To be successful, it's crucial to find the right service for your data needs. While Hadoop is a common big data option, it's not for everyone. One alternative -- especially for developers who prefer SQL to MapReduce -- is Google BigQuery, says cloud expert Dan Sullivan.
While BigQuery facilitates big data analysis implementation, it comes with some tradeoffs. Cloud expert Dan Sullivan details how to use BigQuery and what enterprises can expect from this big data analytics service.
- Analytics as a service aims to solve big data's big problems
Big data workloads can take a toll on traditional IT systems, as large data sets hog resources and are often expensive to run. That's where public cloud comes in, with its scalability and pay-per-use pricing model. Public cloud pricing allows organizations to only pay for what they use, rather than a one-size-fits-all price for big data projects. Additionally, public cloud allows resources to be spun up or down based on a workload's needs.
But there's a flip side to using public cloud for big data. While software as a service can cut costs, security and latency concerns remain. Alex Barrett, editor in chief of Modern Infrastructure, pieces together this big data in the cloud puzzle.
- Apache Sqoop a crucial link to big data analytics in the cloud
The Apache framework, Hadoop, is an increasingly common distributed computing environment for processing big data. As cloud providers capitalize on the framework and more users move data sets between Hadoop and relational databases, tools to help transfer data become important. Apache Sqoop, which transfers bulk data between Hadoop and relational databases, is one such tool.
Although Sqoop has its advantages, default parallelism can be a concern. Sullivan discusses what to watch for and how to ensure optimal performance with Sqoop.
- Amazon DynamoDB, Accumulo access controls secure big data in the cloud
While cloud computing is a solid option for big data projects, security is a hurdle for some organizations. But, according to Sullivan, there are three options for making NoSQL databases more secure: Accumulo, Amazon Web Services' DynamoDB and MarkLogic.
Apache Accumulo, a distributed key-value data store, offers cell-based access controls to determine who can access an organization's big data. AWS' key-value data store, DynamoDB, addresses access control with its identity access management (IAM) policy. MarkLogic, a document-based NoSQL database, provides role-based access control and execution.
Nicholas Rando is assistant site editor for SearchCloudComputing. You can reach him at [email protected].