Kang Xiao's Github Pages

blogging for distributed systems.

Hadoop HDFS Access API Comparation

comparation

-------------------------------------------------------------------------------------------------------------------------------------------------------
|API        |Language |Access Type            |Advantage                            |Disadvantage | 
-------------------------------------------------------------------------------------------------------------------------------------------------------
|java       |java     |first class native api |all hdfs operation support, reliable |othe language can not use |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|libhdfs    |c/c++    |call java through jni  |c/c++ support                        |memory leak |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|fuse       |any      |mount through fuse     |used as local fs                     |mount need to mantain |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|h(s)ftp    |any      |http(s)                |simple http(s) access                |read only access |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|webhdfs    |any      |REST                   |full REST API, no hdfs client needed |REST overhead |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|hdfsproxy  |any      |REST                   |REST API, no hdfs client needed      |proxy server bottleneck |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|hoop       |any      |REST                   |full REST API, no hdfs client needed |proxy server bottleneck |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|thriftfs   |any      |thrift                 |thrift interface                     |thrift server bottleneck, text file only, not mantained any more |
-------------------------------------------------------------------------------------------------------------------------------------------------------

conclusion

  1. use native java api if java is your language
  2. use libhdfs if c/c++ is your language and memory leak is not serious for you (such as non-deamon process)
  3. use fuse if mount mantain cost is light for you (such as one or two machines)
  4. use webhdfs in internal network if you use other language and do not want hdfs client to be deployed
  5. use hoop in external network otherwise
  6. thriftfs is not suggested to be used, use hoop or webhdfs instead

links