Thursday, October 17, 2013
Hadoop
What Hadoop is.
Simply put, it's an Apache open source project that allows users to perform highly intensive data analytics on structured and unstructured data across hundreds or thousands of nodes that are local or geographically dispersed. Hadoop was designed to ingest mammoth amounts of unstructured data and, through its global file system (Hadoop Distributed File System), distribute workloads across a vast network of independent compute nodes to rapidly map, sort and categorize data to facilitate big data analytical queries.
Hadoop was also designed to natively work with the internal disk resources in each of the independent compute nodes within its clustered framework for cost efficiency and to ensure that data is always available locally for the processing node. Jobs are managed and delegated across the cluster farms, whereby data is parsed, classified and stored on local disk. One block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.
In determining the answer to the question, we have to consider that Hadoop is disk-agnostic; that means SAN and NAS resources can be used as the primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage be the most cost-effective way to deploy Hadoop? If a Hadoop implementation will be confined to one or two locations, the benefits of managing a centralized storage resource should make sense, especially if an existing array has already been depreciated.
Hadoop implementation
Simply put, it's an Apache open source project that allows users to perform highly intensive data analytics on structured and unstructured data across hundreds or thousands of nodes that are local or geographically dispersed. Hadoop was designed to ingest mammoth amounts of unstructured data and, through its global file system (Hadoop Distributed File System), distribute workloads across a vast network of independent compute nodes to rapidly map, sort and categorize data to facilitate big data analytical queries.
Hadoop was also designed to natively work with the internal disk resources in each of the independent compute nodes within its clustered framework for cost efficiency and to ensure that data is always available locally for the processing node. Jobs are managed and delegated across the cluster farms, whereby data is parsed, classified and stored on local disk. One block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.
In determining the answer to the question, we have to consider that Hadoop is disk-agnostic; that means SAN and NAS resources can be used as the primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage be the most cost-effective way to deploy Hadoop? If a Hadoop implementation will be confined to one or two locations, the benefits of managing a centralized storage resource should make sense, especially if an existing array has already been depreciated.
Hadoop implementation
Disaster Recovery and Virtualization
- Virtual disaster recovery FAQ
- Virtualization and disaster recovery planning: A guide on storage and server virtualization
- What you need to know about VMware disaster recovery and virtual DR
- Implementing snapshot technology in concert with backup
- Disaster recovery methods for virtual servers
- VM data protection: When snapshots just aren’t enough
- Snapshot technology: The role of snapshots in today's backup environments
- Relying on snapshots and replication in a backup and recovery strategy
- Use a Virtual Storage Appliance for Remote Site Data Protection
- NetApp Brings Private Storage to Amazon Web Services
Subscribe to:
Posts (Atom)