Willkommen! - Bienvenido! - Welcome!

Bitácora digital de Información al cliente de Tux&Cía.
Bitácora Central: Tux & Cía.
Bitácora Técnica (multilingüe): TecniCambalandia
May the source be with you!

Thursday, October 17, 2013

Hadoop

What Hadoop is.
Simply put, it's an Apache open source project that allows users to perform highly intensive data analytics on structured and unstructured data across hundreds or thousands of nodes that are local or geographically dispersed. Hadoop was designed to ingest mammoth amounts of unstructured data and, through its global file system (Hadoop Distributed File System), distribute workloads across a vast network of independent compute nodes to rapidly map, sort and categorize data to facilitate big data analytical queries.
Hadoop was also designed to natively work with the internal disk resources in each of the independent compute nodes within its clustered framework for cost efficiency and to ensure that data is always available locally for the processing node. Jobs are managed and delegated across the cluster farms, whereby data is parsed, classified and stored on local disk. One block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.
In determining the answer to the question, we have to consider that Hadoop is disk-agnostic; that means SAN and NAS resources can be used as the primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage be the most cost-effective way to deploy Hadoop? If a Hadoop implementation will be confined to one or two locations, the benefits of managing a centralized storage resource should make sense, especially if an existing array has already been depreciated.
Hadoop implementation

No comments: