Anant Corporation Blog

Our research, knowledge, thoughts, and recommendations about building and leading businesses on the Internet.

Tag Archives: data processing


DC Data Wranglers: It’s a Balloon! A Blimp! No, a Dirigible! Apache Zeppelin: Query Solr via Spark

I had the pleasure this past Wednesday of introducing Eric Pugh (@dep4b) to the Data Wranglers DC Meetup group. He spoke about using Solr and Zeppelin in data processing and; specifically, the ways big data can easily be processed and displayed as visualizations in Zeppelin. Also broached was Docker, an application Anant uses, and its role in setting up environments for data processing and analysis. Unfortunately, no actual blimps or zeppelins were seen during the talk, but the application of data analysis to events they usually fly over was presented on last month during a discussion about Spark, Kafka, and the English Premier League.

 

Instead of trying to completely rehash Eric’s presentation, please check out his material for yourself (available below). In short, he showed how multiple open-source tools can be used to process, import, manipulate, visualize, and share your information. More specifically, Spark is a fast data processing engine which you can use to prepare your data for presentation and analysis. Whereas, Zeppelin is a mature, enterprise-ready application; as shown by its recent graduation from Apache’s Incubator Program; and is a great tool to manipulate and visualize processed data.

 

 

 

Please don’t hesitate to reach out with any questions or if you are interested in participating or speaking at a future Data Wranglers DC event. Each event is recorded, livestreamed on the Data Community DC Facebook page and attended by 50 or more individuals interested in data wrangling, data processing, and possible outcomes from these efforts. After the monthly event, many members continue their discussions at a local restaurant or bar.

 

I hope to see you at an event in the near future!

How to use Kafka to understand the English Premier League

Here at Anant we are very interested in data wrangling (aka data munging), which basically means, we want to be able to help people take data in one format and convert it to a form that best suits their needs. One way we keep up to date is through the excellent Data Wranglers DC group that meets monthly here in Washington.

 

At the most recent meeting, the group tackled the challenge of integrating real-time video and data streams. Mark Chapman, who is a Solutions Engineering Manager at Talend, explained how his company utilized Spark and Kafka in their product to analyze real time data in the English Premier League (EPL). In addition to the video inputs at 25 frames per second from cameras throughout the stadium, the stream was correlated to data connected to players’ heart rates and other measurements. The EPL is then able to overlay this information into replays to improve presentation and analysis as well as send data to companies offering in-game wagers.

 

The presentation was very interesting and Mark graciously shared his slides:

 

If you are in any way interested in data wrangling – just like it sounds, getting data under control and to work for you – we would love to hear from you and let you know what might be possible with your data streams. If you are in DC and are interested in the technical side of data munging, please come out to the next event and meet us. This past presentation was hosted by ByteCubed (@bytecubed), in Crystal City, but the gatherings have been in Foggy Bottom as well.

Asynchronous Data Processing

The state of the world’s information systems are changing and so should your data processing habits. As the cloud takes precedence in IT environments, different systems that run the modern enterprise are not on the same network or system. These systems have data that business users need to leverage on a day to day and sometimes on an immediate basis.

As big data evolves, we have seen movements from batch processing to micro-batches, to stream processing. All of this is great but folks still need to connect the internet together somehow to access the data.

 

2016-07-13 19.21.22

This presentation was delivered by CEO, Rahul Singh, at The George Washington University to the Data Wranglers DC Meetup on data processing. It outlines the challenge of the current state of business and explains that asynchronous processing is the way to manage the growing sources and volume of business information.

The discussion outlined four main points:

• Thoughts of why “Asynchronous” is the Future

• Discussion about Batch, Micro-Batch, Streaming

• Difference between a Queue / Enterprise Service Bus

• Proposed Architecture for Asynchronous Data Processing

 

Take a look at the slide presented below.

 

Data Wranglers DC is a professional group that meets monthly to discuss topics including open data, data gathering, data munging, and the creation, storage and maintenance of datasets. We combine presentations with hands-on workshops, always seeking to make our data munging lives easier. No experience necessary – just bring your interest.