Cloudera 5.4.2. Avro block size is invalid or too large when using Flume and Twitter streams
There is a small problem when I try Cloudera 5.4.2. The basics of this article
Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
He tries to fetch tweets using Flume and twitter streams to analyze data. Everyone is happy, create a Twitter app, create a directory on HDFS, set up Flume, then start fetching data, create a diagram on top of tweets.
Then here's the problem. Twitter stream converts tweets to Avro format and sends Avro events to downFSAM HDFS, when Avro backed Hive table is loading data, I got the error "Avro block size is invalid or too large".
Oh what is avro block and block size limitation? Can I change it? What does this mean according to this post? Is this a file error? Is this a mistake in some of the entries? If Twitter streaming has encountered error data, it should go down. If all is well to convert tweets to Avro format, on the contrary, Avro data should be read correctly, right?
And I will also try avro-tools-1.7.7.jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/β¦"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DMιζΎγγ¦γΎγ(`ο½₯Οο½₯Β΄)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more
Same problem. I talk a lot, no answers.
Can anyone give me a solution if you met this problem too? Or someone will help give a clue if you fully understand Avro stuff or Twitter feed.
This is a really exciting challenge. I'm thinking about it.
source to share
Using Cloudera TwitterSource
Otherwise, this problem will be resolved.
Can't load twitter avro data correctly into hive table
In the article: This is apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
But it should be cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
And don't just download the pre build, because our cloudera version is 5.4.2, otherwise you will get this error:
Unable to start Flume due to JAR conflict
You have to compile it using maven
https://github.com/cloudera/cdh-twitter-example
Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the Cloudera TwitterSource implementation.
Steps:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven Place in your plugins directory:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
Note: Update Yum to the latest version, otherwise compilation (mvn package) will fail due to some security issue.
source to share