今天帮同事解决一个问题,就是java连接CarbonData数据库的问题。虽然我对大数据方向一无所知,但是我还是想要尝试一下。carbonData是一个什么东西?
Apache CarbonData是一种新的大数据文件格式,使用先进柱状存储,索引,压缩和编码技术实现更快速的交互式查询,以提高计算效率,将有助于加速查询超过PetaBytes数量级数据的速度。查询性能对比详细见carbondata 测试报告,安装文档详细见carbondata 安装文档
是华为贡献给Apache的一种数据格式。其实看github上的star,好像也是一般般的样子。这些都不管了,先上手再说。这里我假设已经安装了通过服务器集群安装了hhadoop和spark分布式环境,而在我的开发机器上,没有安装这两样东西。
参考文章: 1.CarbonData使用示例(java):https://blog.csdn.net/u013181284/article/details/77574094 2.安装Spark+hadoop,spark、hadoop分布式集群搭建…(亲自搭建过!!):https://blog.csdn.net/u014552678/article/details/78584998 3.apache/carbondata:https://github.com/apache/carbondata/tree/master/build 4.编译安装 spark 1.6.1 +carbondata 1.1.0:https://ttnews.xyz/a/5b2513dd9033c003a9a07ec0 5.wso2/carbon-kernel:https://github.com/wso2/carbon-kernel 6.SDK Guide:https://carbondata.apache.org/sdk-guide.html 7.https://mvnrepository.com/artifact/org.apache.carbondata/carbondata-parent/1.6.1
我从网上将源码下载下来( https://github.com/apache/carbondata.git ),里面又例子,我本着试试看的态度,把里面的spark代码加载到idea中,漫长的等待,一个下午了,都没有解决好依赖。不是卡住了,而是不停的转,不停的转,就这样不断的下载,最后我都不知道下载下来了写啥东西,继续等吧,既然没解决完,咱也不敢说,咱也不敢问啊。
经过数个小时的奋斗,最后终于解决了依赖,main函数也不报错了,pom如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 <?xml version="1.0" encoding="UTF-8" ?> <project xmlns ="http://maven.apache.org/POM/4.0.0" xmlns:xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation ="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > <modelVersion > 4.0.0</modelVersion > <groupId > com.proheng</groupId > <artifactId > CarbonTest</artifactId > <version > 1.0-SNAPSHOT</version > <name > CarbonTest</name > <properties > <project.build.sourceEncoding > UTF-8</project.build.sourceEncoding > <maven.compiler.source > 1.7</maven.compiler.source > <maven.compiler.target > 1.7</maven.compiler.target > </properties > <dependencies > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-parent</artifactId > <version > 1.6.1</version > <type > pom</type > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-store-sdk</artifactId > <version > 1.6.1</version > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-core</artifactId > <version > 1.6.1</version > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-spark2</artifactId > <version > 1.6.1</version > </dependency > <dependency > <groupId > org.apache.spark</groupId > <artifactId > spark-sql_${scala.binary.version}</artifactId > </dependency > <dependency > <groupId > org.apache.spark</groupId > <artifactId > spark-sql_2.12</artifactId > <version > 2.4.4</version > </dependency > <dependency > <groupId > org.apache.spark</groupId > <artifactId > spark-sql-kafka-0-10_2.12</artifactId > <version > 2.4.4</version > <scope > provided</scope > </dependency > </dependencies > <build > </build > </project >
main.java中的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 package com.proheng;import java.io.File;import java.io.IOException;import org.apache.carbondata.core.constants.CarbonCommonConstants;import org.apache.carbondata.core.util.CarbonProperties;import org.apache.spark.sql.CarbonSession;import org.apache.spark.sql.SparkSession;public class App { public static void main (String[] args) throws IOException { CarbonProperties.getInstance() .addProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, "yyyy/MM/dd HH:mm:ss" ) .addProperty(CarbonCommonConstants.CARBON_DATE_FORMAT, "yyyy/MM/dd" ); SparkSession.Builder builder = SparkSession.builder() .master("local" ) .appName("App" ) .config("spark.driver.host" , "http://192.168.1.33:7077" ); SparkSession carbon = new CarbonSession .CarbonBuilder(builder) .getOrCreateCarbonSession(); exampleBody(carbon); carbon.close(); } public static void exampleBody (SparkSession carbon) throws IOException { carbon.sql("SELECT * " + "FROM source " + "WHERE stringField = 'spark' and floatField > 2.8" ) .show(); } }
问题: (1) 结果报错是:SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(GIS); groups with view permissions: Set(); users with modify permissions: Set(GIS); groups with modify permissions: 似乎出现了权限的问题。
虽然毫无进展,但是一定要保持谜一样的从容。
(2) Failed to locate the winutils binary in the hadoop binary path
这个问题和上一个问题其实是一个问题,主要是需要在windows开发机器上添加如下的环境变量。下载hadhoop,放到自己想放的目录。然后添加系统环境变量HADOOP_HOME,值为hadoop的路径,然后添加变量HADOOP_USER_NAME为root
将%HADOOP_HOME%\bin添加到path变量中
注意,添加完环境变量,要重启,这个就解决了,当然还是有新的问题的,就是无法连接远程hadoop
参考文章: 1.windows下运行spark程序报错:Failed to locate the winutils binary in the hadoop binary path:https://blog.csdn.net/iamlihongwei/article/details/79876947
(3) StandaloneAppClient$ClientEndpoint:87 - Failed to connect to master
参考文章: 1.Hadoop DataNode 无法连接到主机NameNode:https://blog.csdn.net/a1055186977/article/details/72852112 2.关于Spark报错不能连接到Server的解决办法(Failed to connect to master master_hostname:7077):https://blog.csdn.net/ybdesire/article/details/70666544 3.spark远程调试代码报错 StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.126.128:7077:https://blog.csdn.net/cyssxt/article/details/73477754 4.Spark and Java: Exception thrown in awaitResult:https://stackoverflow.com/questions/40439652/spark-and-java-exception-thrown-in-awaitresult 5.spark在那里指定master URL呢?:https://stackoverflow.com/questions/40439652/spark-and-java-exception-thrown-in-awaitresult
下面是一个新的pom.xml虽然也可以找到相关的类,但是还是无法成功连接
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 <?xml version="1.0" encoding="UTF-8" ?> <project xmlns ="http://maven.apache.org/POM/4.0.0" xmlns:xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation ="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > <modelVersion > 4.0.0</modelVersion > <groupId > com.proheng</groupId > <artifactId > CarbonTest</artifactId > <version > 1.0-SNAPSHOT</version > <name > CarbonTest</name > <properties > <project.build.sourceEncoding > UTF-8</project.build.sourceEncoding > <maven.compiler.source > 1.7</maven.compiler.source > <maven.compiler.target > 1.7</maven.compiler.target > </properties > <dependencies > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-parent</artifactId > <version > 1.6.1</version > <type > pom</type > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-store-sdk</artifactId > <version > 1.6.1</version > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-core</artifactId > <version > 1.1.0</version > </dependency > <dependency > <groupId > org.apache.carbondata</groupId > <artifactId > carbondata-spark2</artifactId > <version > 1.6.1</version > </dependency > <dependency > <groupId > org.apache.spark</groupId > <artifactId > spark-sql_2.12</artifactId > <version > 2.4.4</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-client</artifactId > <version > 2.7.1</version > </dependency > </dependencies > <build > </build > </project >
结果报错:
参考文章: 1.Spark 2.1.0与CarbonData 1.0.0集群模式部署及使用入门指南:https://www.iteblog.com/archives/2078.html 2.Spark 2.0系列之SparkSession详解:http://www.raincent.com/content-85-7196-1.html 3.Spark 2.0介绍:SparkSession创建和使用相关API:https://www.iteblog.com/archives/1673.html