Anant Corporation Blog

Our research, knowledge, thoughts, and recommendations about building and leading businesses on the Internet.

What Makes a Good ETL Project?

Bad

  1. Bad ETL (extract, transform, load) projects are ones that don’t have a strategy for different types of information or lack of knowledge management on how to add/remove different data sources, add/remove processors & translators, and add/remove different sinks of information.
  2. It doesn’t necessarily have to be on any particular platform, just that it has structure.. just as any software should have.. an architecture.

 

Good

  1. Simple systems that separate E / T / L into composable blocks that are scriptable or configurable.
  2. Compiled systems are good too if the volume is extreme.
  3. A good bash pipeline is just as good as any as long as its well documented.

 

Ugly

  1. Using ESB (Enterprise Service Bus) for ETL.
  2. Using Spark for ETL.
  3. Basically using things that have advanced features for business logic for doing simple transformations that really don’t need to belong in these computing environments. Conjoining simple message delivery (ETL) to an advanced message delivery (ESB) or advanced computing (Spark).

 

Why should an organization undertake such a project?

To meet a business goal(s). Sometimes it’s to gain intelligence, sometimes it’s to create data products to create value, sometimes it’s to show predictions.

Comes down to how does it affect the organizations’ perpetuity. Some of the questions a business should be able to answer are:

 

What other solutions provide the same end user results?

Tools like Domo, or Tableau, or recently something like Periscope (in the SaaS world) can be useful to gain basic insights without having to do ETL if the data is ready. Other open source tools can be used as well such as Kibana, Metabase, and Redash as long as the data is available.

 

What are the trade-offs between the various solutions?

Ultimately if the data isn’t ready, ETL may be required to get it clean enough for those tools to allow users to visualize/explore it properly.

Data Wrangling & Visualization of Public / Government Data

Rahul Singh, CEO of Anant, had the opportunity to co-organize and MC the May Meetup of Data Wranglers DC where the speakers, John Clune and Timothy Hathaway, covered two topics related to open government and public data for data processing and visualization. We had a great turnout at the event and had the chance to do some networking after.

Our next two meetups will focus on Search & Knowledge Management (June Meetup) and Machine Learning for Data Processing (July Meetup), check out the Meetup page for more details when they become available.

Big thank you to John Clune and Timothy Hathaway for taking the time to present to the group if you have any interest in speaking please don’t hesitate to reach out to Rahul at rahul@anant.us.

 

Below you can find a recording of both presentations.

searchstax.solr.drupal.logos

How to Set Up a Drupal Website Connected to a SearchStax Deployment

Basic Steps:

  1. Create a new Deployment in SearchStax
  2. Upload a Custom Configuration to Your Solr Server
  3. Install Drupal
  4. Install Search API / Search API Solr plugin
  5. Configure the Search API Solr Plugin
  6. Add sample content
  7. Optional – Manually Index Your Site

Step 1: Create a New Deployment in SearchStax

Assuming you have already created a SearchStax account and do not already have a deployment set up, click on the Deployments tab and then click on the Add Deployment button at the top.  Enter a Deployment name, and select the most appropriate Region, Plan, and Solr Version for your needs.  In this example we will be using Solr Version 6.4.2.

Once you create your Deployment, you will see it in the Deployments dashboard.

Clicking on the arrow button on the right of the deployment will give you pertinent information about your deployment’s servers.  The Solr Load Balancer URL will bring you to your Solr server dashboard.

Step 2: Upload a Custom Configuration to Your Solr Server

Download the Search API Solr plugin files: https://www.drupal.org/project/search_api_solr

Included in the Search API Solr download are several configurations in the solr-conf folder, with subfolders 4.x, 5.x, and 6.x for the respective Solr versions.  

SearchStax uses Apache ZooKeeper for maintaining configuration information.  Upload the appropriate configuration files via Zookeeper and create a new collection.  If you have your Zookeeper script already, the two commands you will need are as follows:

 

Upload Configuration:

Linux:

zkcli.sh -zkhost <zookeeper URL> -cmd upconfig -confdir <Solr configuration> -confname <configuration name>

Windows:

zkcli.bat -zkhost <zookeeper URL> -cmd upconfig -confdir <Solr configuration> -confname <configuration name>

 

Create New Collection:

Linux:

curl ‘<Load Balancer URL>admin/collections?action=CREATE&name=<collectionName>&collection.configName=<configName>&numShards=1’

Windows:

curl.exe “<Load Balancer URL>admin/collections?action=CREATE&name=<collectionName>&collection.configName=<configName>&numShards=1” -k

 

Detailed instructions on uploading a new configuration and creating a new collection for a Solr deployment can be found here: https://www.measuredsearch.com/docs/

 

Once complete, go into your Solr dashboard. There will be a newly created core based on the name of the collection name you defined when uploading the configuration. Make note of this core name.

Step 3: Install Drupal

If you do not already have one, there are many ways to create your own Drupal instance.  Some web hosting services offer specialized integrations and setup options that help streamline the process.  Here we used GoDaddy’s Cloud Server services, which in a few clicks can create a hosted Drupal website.

Step 4: Install Search API / Search API Solr plugin

Go to your Drupal website and log in.

 

Open up a new tab in your web browser, then go to: https://www.drupal.org/project/search_api.  Once there, scroll down to find the different download links.  Right click on the appropriate Download link to the compressed file, and copy the link address.

In your Drupal site, either navigate to the /admin/modules page or click the Extend tab at the top.  Then click Install New Module.

Paste the link address copied earlier into the “Install from a URL” text field, and then click the Install button.

Once complete, you should see a confirmation message saying that the installation was complete.

If you see this, then you can install the Solr plugin for Search API:

https://www.drupal.org/project/search_api_solr.  Install this in the same way as above.

 

Before continuing, you may need to install the Search API Solr composer dependencies.  See https://www.drupal.org/documentation/install/composer-dependencies for instructions on how to do this.

 

Next, you will need to enable the installed modules.  Click “Enable newly added modules”, or click on the Extend tab.  Scroll down to the section called Search to see new module settings.

Enable the following items: Search API, Solr search, and Solr Search Defaults.

Then click Install at the bottom of the page. You should see this confirmation:

Once complete, it is recommended to uninstall the Solr Search Defaults module, as it may affect performance.  Uninstalling the module will not remove the provided configurations.

Step 5: Configure the Search API Solr Plugin

Now that the modules have been enabled, click on the Configuration tab.  Look for section “SEARCH AND METADATA” and click on Search API to configure it.  

Once there, click Add Server.

Give your server a name, make sure the Backend is set to Solr, and configure your Solr Backend.  Check that HTTP protocol is set to https, the Solr host and Solr port are correct, and you set the Solr core that was created after uploading your Solr configuration.

If your configuration settings are valid, you will see a message saying that the information was saved successfully.

Next you will need to define an Index.  In the Search API configuration screen, click on Add Index.

Give your index a name, and select the Data sources you wish to index.  For this example, select Comment and Content.  Also, at the bottom of the page make sure that you select the Server created earlier.

Step 6: Add Sample Content

 

Install the Devel plugin using the same method as described in Step 4: https://www.drupal.org/project/devel.  Then, enable Devel and Devel generate.

You will see new options in multiple menus.  Go to Manage > Configuration, and scroll down to the Development section.  Here you have options to Generate content via Devel.

Click on Generate Content.

Select a Content Type and enter the number of nodes.  For example, selecting Article and typing “20” nodes will produce 20 new articles filled with dummy data.

Step 7: Optional – Manually Index Your Site

A cron job will periodically index your site automatically, but if you want to see your results immediately go to the Search API configuration screen and click on the index created earlier.  At the bottom, click on the Index now button.

After you begin, you will see a progress bar.  Once it reaches 100%, you will get a Success message.

After your site has been indexed, you can view and query the data in Solr.

Congratulations!  You have now customized your Drupal website to allow for content to be indexed in your SearchStax Solr deployment.

Software Algebra – Building Applications Without Reinventing the Wheel

As technology has continued to mature in the last two decades there have been many challenges overcome, obstacles faced, and solutions crafted. A recurring theme, in the area of obstacles (more specifically, self-imposed obstacles), has been the propensity for software companies to more often than not 1) turn to developing applications from the ground up for a particular problem or 2) take existing pieces of software that are perfectly fine for their specific use case and then tailor them to a different (but sometimes slightly similar) use case.

 

Software Algebra is essentially a best practice in software development to make sure that we are using 1) the tools best suited to a particular problem 2) while also dodging the trap of re-inventing the wheel by starting from scratch or trying to fit a tool into solving a problem it was never intended to address.

 

There are multiple cases in which this best practice is entirely ignored, most commonly so by inexperienced software architects who have the “my hammer can solve all problems” mindset. Often times, one of the best ways to avoid falling into this trap is to relentlessly focus on getting a Minimal Viable Product (MVP) out of the door in a time-boxed span of time and iterating multiple times on that MVP to steadily bring it up to support all use cases.

 

We recently spoke about this topic at the WebTech Conference in Washington, DC and will be doing so again on Tuesday, April 11th at 6PM at the University of Maryland Baltimore County, you can find additional details as well as register for the event here.

WebTech Conference – Software Algebra

Rahul Singh will be presenting on Thursday, March 30th at WebTech on the topic of Software Algebra. He’ll be speaking about the ways online software (SaaS) and open source applications can work in tandem with web and mobile applications to deliver powerful business results.

 

From this presentation, you will learn how to plan and build web apps that support business process using existing software. This approach takes a very practical approach to taking inventory of business teams, processes, information, and systems and creating future-proof web systems without reinventing the wheel.

 

Thank you, Iron Yard DC for hosting us! Register and view the schedule of speakers here, looking forward to seeing you.