Cloudera 5.4.2. Avro block size is invalid or too large when using Flume and Twitter streams

Question

Cloudera 5.4.2. Avro block size is invalid or too large when using Flume and Twitter streams

There is a small problem when I try Cloudera 5.4.2. The basics of this article

Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

He tries to fetch tweets using Flume and twitter streams to analyze data. Everyone is happy, create a Twitter app, create a directory on HDFS, set up Flume, then start fetching data, create a diagram on top of tweets.

Then here's the problem. Twitter stream converts tweets to Avro format and sends Avro events to downFSAM HDFS, when Avro backed Hive table is loading data, I got the error "Avro block size is invalid or too large".

Oh what is avro block and block size limitation? Can I change it? What does this mean according to this post? Is this a file error? Is this a mistake in some of the entries? If Twitter streaming has encountered error data, it should go down. If all is well to convert tweets to Avro format, on the contrary, Avro data should be read correctly, right?

And I will also try avro-tools-1.7.7.jar

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}

{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`･ω･´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)

at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

Same problem. I talk a lot, no answers.

Can anyone give me a solution if you met this problem too? Or someone will help give a clue if you fully understand Avro stuff or Twitter feed.

This is a really exciting challenge. I'm thinking about it.

+3

hdfs avro flume flume-ng flume-twitter

dong 17 Mar 16 at 6:41 am

source to share

1 answer

dong · Answer 1 · 2016-03-23T21:36:28+0000

Using Cloudera TwitterSource

Otherwise, this problem will be resolved.

Can't load twitter avro data correctly into hive table

In the article: This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

But it should be cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

And don't just download the pre build, because our cloudera version is 5.4.2, otherwise you will get this error:

Unable to start Flume due to JAR conflict

You have to compile it using maven

https://github.com/cloudera/cdh-twitter-example

Download and compile: flume-sources.1.0-SNAPSHOT.jar. This jar contains the Cloudera TwitterSource implementation.

Steps:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven Place in your plugins directory:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

mvn package

Note: Update Yum to the latest version, otherwise compilation (mvn package) will fail due to some security issue.

Cloudera 5.4.2. Avro block size is invalid or too large when using Flume and Twitter streams

More articles: