Anant Corporation Blog

Our research, knowledge, thoughts, and recommendations about building and leading businesses on the Internet.

Tag Archives: big data


Datastax / Cassandra Data Modeling Strategies

Common Problems in Cassandra Data Models

In our opinion Cassandra is one of best nosql database technologies we’ve used for high availability, large scale, and high speed business platforms, More specifically, we work with Datastax Enterprise version for Cassandra where the clients are above a certain size and need to have enterprise grade support 24/7 365 days a year with expertise around the world. There are many topics in which I could have written about as my first “Cassandra” post on our blog, but decided to write about what I call the three stooges of Cassandra data modeling: Larry (Tombstones), Curly (Data Skew), and Moe (Wide Partitions).

Continue reading

What Makes a Good ETL Project?

Bad

  1. Bad ETL (extract, transform, load) projects are ones that don’t have a strategy for different types of information or lack of knowledge management on how to add/remove different data sources, add/remove processors & translators, and add/remove different sinks of information.
  2. It doesn’t necessarily have to be on any particular platform, just that it has structure.. just as any software should have.. an architecture.

Continue reading

Data Wrangling & Visualization of Public / Government Data

Rahul Singh, CEO of Anant, had the opportunity to co-organize and MC the May Meetup of Data Wranglers DC where the speakers, John Clune and Timothy Hathaway, covered two topics related to open government and public data for data processing and visualization. We had a great turnout at the event and had the chance to do some networking after.

Continue reading

The Swarm of Sources

Reactive Manifesto, The Next VisiCalc, and Future of Business Technology

Thanks to some of our great partnerships, our firm has recently consulted at University of Michigan, Cisco, Intuit, and Kroger and at several government agencies on business information and enterprise technology. Even though we don’t directly create consumer technology or applications, eventually all consumer technologies have a backend enterprise technology that makes it work because a consumer technology company backed by crappy technology for the enterprise is bad for business.

I’ve been sensing a shift in business information for a while. Business information, the frequency it’s created at, the number of sources it comes from is only increasing, exponentially if not logarithmically. This means, that businesses, and subsequently end-users need to rely on real-time data processing and analysis of this information. The businesses that embrace the “reactive manifesto” of how to build software and technology are going to succeed in the new world where data is coming from millions of people through their mobile devices, processes through applications and software processes, information through global data sources and APIs, and systems in the form of servers and things all over the globe. The “swarm” of sources is mind-boggling.
The Swarm of Sources

The Swarm of Sources

The first business response to all this business information is: let’s bring it all together to analyze it and visualize it. That’s horseshit. Even with the big data technologies out there today, it is wasteful to try to process all of it at the same time. That’s like trying to understand how the universe works at every second. The better response is to understand what’s happening and react to it at the moment in the context that it is important.

This reactive methodology of building large infrastructure can help businesses react to new IoT initiatives, integrating with numerous business software to run the modern enterprise, and partnering with other modern enterprises. Whatever you see out there in apps, devices, sites, and APIs has to be managed in the back. The reason for silicon brains is stronger when you just can’t do it with carbon brains. Technology has to get better faster through iterative machine learning in order keep up with the amount of data that’s being created.

Commercial organizations are being thrown sledgehammers to solve things by vendors such as Oracle, Cloudera, MapR, DataBricks, etc. Although these products are great, they are more like Personal Computers .. but without the real “Killer App.” They aren’t solving industry-specific / vertical problems. Consulting companies waste inordinate time & materials costs to get it “right.” What people need are “lego block” software so that non-technical folks can self-service their information needs without hiring a team of data analyst, data engineer, data architect, data scientist, data visualizer, and of course a project manager. (If you do need a team today, Anant provides an elastic team with all of those skills for the same investment per month as a part-time or full-time employee. Message me or my team.)

I believe the major breakthrough that will change the experience for business technology users is going to be system design tools that help them get what they want without knowing how to program. I don’t know what it will look like, but we need a VisiCalc for the new age, and no it’s not Google Spreadsheets. It’s something else altogether. It’s something that will fix the yearning for a tool that helps people mashup and leverage various gradients between structured and unstructured data in dynamic knowledge pages that always keeps us up to date on what we care about. A machine that learns what we need to know and summarizes it for us, but also allows us to manipulate the knowledge even if it is being created in 10 different systems.

DC Data Wranglers: It’s a Balloon! A Blimp! No, a Dirigible! Apache Zeppelin: Query Solr via Spark

I had the pleasure this past Wednesday of introducing Eric Pugh (@dep4b) to the Data Wranglers DC Meetup group. He spoke about using Solr and Zeppelin in data processing and; specifically, the ways big data can easily be processed and displayed as visualizations in Zeppelin. Also broached was Docker, an application Anant uses, and its role in setting up environments for data processing and analysis. Unfortunately, no actual blimps or zeppelins were seen during the talk, but the application of data analysis to events they usually fly over was presented on last month during a discussion about Spark, Kafka, and the English Premier League.

 

Instead of trying to completely rehash Eric’s presentation, please check out his material for yourself (available below). In short, he showed how multiple open-source tools can be used to process, import, manipulate, visualize, and share your information. More specifically, Spark is a fast data processing engine which you can use to prepare your data for presentation and analysis. Whereas, Zeppelin is a mature, enterprise-ready application; as shown by its recent graduation from Apache’s Incubator Program; and is a great tool to manipulate and visualize processed data.

 

 

 

Please don’t hesitate to reach out with any questions or if you are interested in participating or speaking at a future Data Wranglers DC event. Each event is recorded, livestreamed on the Data Community DC Facebook page and attended by 50 or more individuals interested in data wrangling, data processing, and possible outcomes from these efforts. After the monthly event, many members continue their discussions at a local restaurant or bar.

 

I hope to see you at an event in the near future!