Sergey Nivens - Fotolia
Amazon Neptune arms analytics teams with graph databases
Turn to a graph database to put aside table joins for good. With Amazon Neptune, a developer can assess and query relationships between data sets.
As data sets continue to grow in size and complexity, IT leaders need more efficient storage options, which leads them, in some cases, to graph databases.
A graph database design enables developers to process and correlate the relationships between data points. Amazon Neptune is one example of a managed graph database service, and it can provide an ideal platform for scientific, social and behavioral types of analytical tasks.
Let's take a closer look at the specifics of graph databases and what Amazon Neptune offers for users.
What is a graph database?
A typical database is a highly structured entity in which specific data types are recorded into clearly defined formats, such as columns or rows. A user can then query the database to find specific information related to the data. This kind of search works well in a single table, but it can be difficult to make indirect queries or find relationships in dissimilar table constructs. And it often requires the use of foreign keys or table joins.
A graph database specializes in data relationships. Each file that resides in a graph database design is either a node or vertex. Each node contains information or properties, and they can be connected by identifying relationships, called edges. Each vertex is basically a data file or data node. Organizations primarily use graph databases when they are interested in the relationships between data. These databases perform these queries well and enable the addition of different node types and property compositions.
For example, consider a social networking platform that uses a graph database design. Each person is a node, and the properties of each node include information, such as name, address and age. A query might ask to find relationships between these nodes by locating common friends, relatives and employers. These types of applications typically represent those relationships visually, such as a web of friends.
Dive into Amazon Neptune
Amazon Neptune is a graph database service that specializes in processing highly connected data sets, with high-performance search capabilities for relationships among billions of nodes.
Neptune works with open graph APIs, which support various graph model and query language combinations, such as Apache TinkerPop with Gremlin and World Wide Web Consortium's (W3C) Resource Description Framework (RDF) with SPARQL. These tool combinations enable developers to create and access graph database designs on AWS. A user loads data from Amazon S3 directly into Amazon Neptune. The Property Graph platform, via Apache TinkerPop, uses data in CSV format, while RDF data can use serializations, such as Turtle, N-Triples, N-Quads and RDF/XML, to format the data.
Amazon Neptune can perform more than 100,000 graph queries per second, and Auto Scaling can help provision more resources for Neptune as data sets and query needs change. In addition to high performance and scalability, the graph database service also touts 99.99% availability, as it automatically replicates six copies of data across three availability zones and performs continuous data backups to S3. AWS secures Neptune through Amazon Virtual Private Cloud (VPC) network isolation and data encryption.
Developers can select up to 40 database instances. Sizes include db.r4.large, db.r4.xlarge, db.r4.2xlarge, db.r4.4xlarge and db.r4.8xlarge, although AWS can increase some limits upon request.
A developer can only access Neptune instances through a VPC and manage them through AWS Management Console. AWS also limits accounts to 20 clusters, 50 database subnet groups, 100 database snapshots and 25 database security groups. Neptune requires a minimum of 10 GB storage that automatically scales in additional 10 GB increments up to 64 TB.
How to use Amazon Neptune
Once a developer invokes an Amazon Neptune database, he can use either Gremlin or SPARQL endpoints to process a request. For example, most users will first launch the Gremlin Console then connect it to the Neptune database instance, which makes the service available. Be sure to use remote mode so that Gremlin queries are sent to the remote connection.
A developer might add a vertex, or multiple vertices, to start. Each vertex carries one or more labels and will contain information or properties, such as names. Next, he can add edges, which establish the connections or relationships between vertices, as well as the weight of that relationship. Developers can manage vertices, properties and edges through the Gremlin Console.
Now, the developer can run a data analysis, which is called a traversal. For example, he can list vertices with specific labels or check social relationships. There can be multiple traversals, in which each refines the analysis, and only the final traversal contains the desired information.
Amazon Neptune pricing
It can be tricky to assess Amazon Neptune pricing, as you need other AWS resources to facilitate the service. For example, costs include per-hour fees for memory-optimized database instances, which range from $0.348 per hour for db.r4.large instances to $5.568 per hour for db.r4.8xlarge instances. Amazon Neptune imposes additional monthly storage costs per gigabyte, while data movement also carries fees per million requests. S3 backup storage provides a free tier, but it can impose costs for additional capacity, as well as data movement requests, such as puts and gets.
Consequently, it is difficult to predict precise and consistent costs for an Amazon Neptune deployment. AWS shows an example in the range of $300 per month for a single small database instance, plus storage and modest data transfer activities. However, larger or multiple database instance deployments impose considerably higher costs.
Amazon Neptune competes with other graph database services, including Neo4j Graph Platform and IBM Graph, which specializes in property graphs. Azure also offers Cosmos DB, which is designed to support graphs and traversals.