Quantcast
Channel: Blog – Talend Real-Time Open Source Data Integration Software
Viewing all 824 articles
Browse latest View live

#BigDataWithTalend: Leveraging Telematics Data in the Insurance Industry

$
0
0

We asked Hadoop Summit attendees to share their big data story.

In this interview, Sean Creagan provides insight on big data and its business value for his insurance company.

A few highlights from this video include:

  • Provide value with big data to business users
  • Leverage telematics data utilizing Hadoop
  • Reduce costs with Hadoop instead of working with traditional data platforms

More videos are available on the Talend Channel.

Yves

 


#BigDataWithTalend: Discover Supplements ETL with Big Data

$
0
0

We asked Hadoop Summit attendees to share their big data story.

In this short sequence, Alex Marshall of Discover Financial Services provides insight on reducing costs with the help of big data.

A few highlights from this video include:

  • Leverage unstructured data more efficiently
  • Replace ETL with Apache Pig, Hive and Sqoop
  • Pull data from RDBMS to HDFS, parse in parallel and put back into the RDBMS

Check out this short video to learn how Discover is reducing costs with big data!

More videos are available on the Talend Channel.

Yves

 

A Short Summary of Talend Big Data, Live from the Pacific Northwest BI Summit

$
0
0

Few vendors are invited by Scott Humphrey to this annual gathering of data gurus – and for the past six years, I have had the privilege to represent Talend.  As posts from previous years attest, this is an event rich in discussions and brainstorming. This is also a unique event where we get the opportunity to sit down with analysts for fireside discussions (except that there is no need for a fire in Oregon in July).

The podcast below, recorded with Claudia Imhoff from BBBT, highlights the journey of customers into big data, and how Talend is supporting this journey. I am discussing with Claudia views on the evolving maturity of the big data market, from the sandbox into productive use and then toward operational and real-time big data. We also review real-life use cases, and discuss Talend’s latest adoption & learning tool: the Talend Big Data Sandbox!

Listen to the podcast:

Yves

 

MDM for Material Data (MDM Summer Series Part 2)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we go across what we identified as thefive most frequent use cases of MDM for product data.

In this post, we focus on MDM for Material Data; it aims to manage centrally information about spare parts, raw materials and final products, and then share this trusted and unified view across organizations, processes and information systems.

MDM for Material Data is seen in industries that engineer, procure, manufacture, store, sell and configure products. Typically, this is what defines the manufacturing industry. Material refers to potentially any inventory items that can be uniquely identified by an SKU (Stock Keeping unit), from the raw material to the finished goods, so the focus is not, or not only, the customer facing side of the product.

One could think that most companies have already established a single point of control to manage material data across their business processes, typically in their ERP. But, the reality is that few companies have consolidated all their processes in a single ERP instance. The others may have a specific system (or systems) and related process for the Research and Development; or for CRM and sales and distribution; or for procurement, maintenance, etc. And they may have different systems and processes across lines of businesses or geographies as well. This heterogeneity tend to increase over time, because of mergers and acquisitions, business processes that are being outsourced or insourced, or IT systems that expand at the periphery of the legacy IT rather that at its core, as illustrated by the current raise of adoption of cloud solutions.

This results in inconsistent and replicated views of product data, with resulting and cascalading process inefficiencies across the product life cycle, which explains why many companies are engaging in MDM for material data initiatives. Having a shared, up-to-date and accurate view of products and their characteristics is a foundation for efficiency. This starts at the early steps of the product lifecycle, where new product characteristics has to be shared beyond the Research & Development team with other activities such as Procurement, Marketing, Supply Chain, or Accounting and Controlling. Reduced time to market while launching new products or improved parts reuse to reduce procurement or manufacturing costs are example of the business benefits tackled.

Then, once the product is ready for distribution, strong alignment is needed between numerous stakeholders across sales, logistics, finance, and manufacturing, especially in a case of build to order products. Inventory optimization and cost reduction optimized spend management when negotiating procurement contracts together with improved visibility into procurement and sourcing are then expected. Although those are steps that may be covered by an ERP, organizations may want to expand beyond its scope. For example, many organizations are looking for more integrated, agile and synchronized for planning, a process that has often been conducted outside of the ERP, sometimes in a poorly coordinated way  through ad-hoc spreadsheet or disparate planning applications. Last but not least, the after sales process is a step in the product lifecycle that has its own supply chain, which may have raised little attention for optimization in the past. As a result, Maintenance, Repair and Operations may even be seen as a use case on its own, one that we will further discover in the next post of this series.     

IT agility can be another strong business driver for MDM for Material Management, due to increasing costs of data migration and management of data quality, together with a limited ability to reap the benefits of innovation at the right pace. Enterprise agility is another potential benefits, especially in the case of mergers and acquisitions to reap the benefits of a rapid consolidation of activities that rely on the efficient procurement  and distribution of items across the supply chain

Jean-Michel

 

#BigDataWithTalend: Customer Service Analysis with Hadoop

$
0
0

We asked Hadoop Summit attendees to share their big data story.

In this short sequence, Therian Webb of ClickFox provides insight on customer service data analysis.

A few highlights from this video include:

  • Load and connect data from disparate sources
  • Follow up on billing, identify difficulties and improve customer satisfaction through Hadoop
  • Help companies succeed in servicing their customers better through digital and social channels

Check out this short video to learn how ClickFox is improving customer satisfaction with big data!

More videos are available on the Talend Channel.

Yves

 

MDM for Lean Managed Services (MDM Summer Series Part 3)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we go across what we identified as thefive most frequent use cases of MDM for product data.

This post focuses on MDM for Lean Managed Services,  a use case that we are seeing in companies that operate an infrastructure composed of a large number of equipment. For example, this can be a facility manager that operates a set of devices to deliver IT or network capabilities to its customers; or a utility provider that manages a networked grid; or a provider of Maintenance, Repair and Operations related services. This also applies for enterprises that are providing support and services for their product line.

Those businesses, no matter if they are insourced as a shared services inside a company or provided as an outsourced business process to a customer of the enterprise, are under pressure to reduce costs and add value to the core business of their client organization whenever possible. They need to have a single view of the equipment that they manage, which often happen to be disparate legacy equipment. Master Data Management helps to get as shared and accurate view of the equipment, to standardize their characteristics and behavior, and in some case even to make them actional by orchestrating their remote operations. In that case, the business case of Master Data Management is pretty straightforward, because it can be directly linked to the reduction of the cost of service.

Take the example of the Maintenance, Repair and Operations (MRO) processes in the manufacturing industry: once companies have re-engineered their supply chain processes, they find that Maintenance, Repair and Operations represent a significant portion of their remaining product costs. Surveys show that, in many organizations, MRO inventory may accounts for a significant slice—typically from 15 to 40 percent—of the annual procurement budget, generally not with the level of rigor seen in the core manufacturing supply chain.  Business benefits typically express themselves in terms of minimized parts procurement costs together with reduced inventory costs. This is also an area where enterprises can differentiate from their competition by selling their product as a service, and provide innovating “after sales” services to their customers. And this all starts by improving the inventory control, even when the products in under the hands of customers, an area where MDM for Product Data indeed is of great help.

Now that we are entering in the era of the Internet of things, where products can report on their behavior and be actionable remotely, my point of view is that this use case will grow dramatically and drive the market of MDM for Product Data into innovation and next practices. General Electric is a reknowned pioneer in this area, with his initiative around the Industrial Internet. It is about operating "ecosystems of connected machines to increase efficiency, minimize waste, and make the people operating them make smarter decisions". This is a data driven initiative at the crossroads of MDM and Big Data.

Jean-Michel

 

MDM for Regulated Products (MDM Summer Series Part 4)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we go across what we identified as the five most frequent use cases of MDM for product data.

In this fourth part of the series, we focus on MDM for Regulated Products. This use case happens when products must comply with government or industry regulations.  This mandates to adhere to standard codifications and to exchange information beyond the enterprise walls to provide control and traceability on how products are sourced, tested, manufactured, packaged, documented, transported or promoted.

More and more industries need to report on the supply chain related to their products. Traceability mandates are driving towards increased velocity and precision, and this starts with standards codification of products attributes and characteristics.

Life Sciences is the obvious example of an industry whose products are heavily regulated, from the different phases of trials to manufacturing and distribution. But, many other industries are increasingly impacted too: from Financial Services to Agriculture, from Healthcare to Chemicals or Consumer Products, best practices for Information Governance have matured and sometimes have even reached the status of mandated standard.

The first and foremost benefit of this use case is to minimize the costs and risks of compliance. Regulated products need precise traceability of the product supply chain and the corresponding information supply chain. This generally involves many stakeholders that need to access to shared product information, and update them chen needed according to their activity under the control of a well defined and auditable workflow.    

As a result, even when those codification standards are not mandatory, extended supply chain can reap benefits in adopting them for real time exchange of information between business partners. The overall process then becomes more uniform, efficient and quality proofed, and more open for continuous improvement because it is being carefully and precisely controlled and monitored. Because of a more holistic and shared view of the product data and related processes, making informed decisions about the regulated products is facilitated too ; Reacting faster to alerts because there are detected and shared earlier is a clear business benefit, too.  

Finally, the business benefits for their adoption may be as simple as staying in the business, as illustrated by a Gartner prediction: “By 2016, 20 percent of CIOs in regulated industries will lose their jobs for failing to implement the discipline of information governance successfully.”

Jean-Michel

 

Product Information Management (MDM Summer Series Part 5)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we go across what we identified as the five most frequent use cases of MDM for product data. In this post, we focus on the Product Information Management use case.

Product Information Management (PIM) is the most popular use case of MDM for Product Data. For companies that distribute off-the-shelf products, this has become a must, especially where those products are distributed across multiple channels. This is probably why this use case drives so much attention from solution vendors, even if it applicable only in some industries.

As it is focused on customer facing processes, business drivers are relatively straightforward to define: trusted and easily accessible product data boosts product attractiveness, and as a result, it drives new customer demand, increases transformation rate, drives up sales, cross sales or follow-up sales, etc.  This is a business case I already discussed in a previous blog.  

The business case for PIM is pretty well documented by consulting companies and industry analysts. Ventana Research shares very interesting benchmark data.  They found that the benefits include the elimination of errors while referencing product data and sharing it with the customers, cross-sell and up-sell potential, and improved customer experience. Although PIM is not new, the research show that organizations believe that they still have a long way to go on their maturity curve: it shows that only a quarter of organizations trust their PIM processes completely, while 94% use spreadsheets heavily or moderately in their processes, despite the fact that one-fifth find major errors in them frequently and another quarter find them occasionally.

In addition to improving the product induction process and boosting product attractiveness, PIM is drives also the expansion of the product catalog, now that many companies are in the process of extending their product portfolio into a long tail for example by creating online marketplaces where third party vendors can participate. This mandates a very smooth and low cost approach when referencing new products, or changing their characteristics like price or availability in real time; the goal is also to ensure than product data is accurate as early as possible in the information supply chain, delegating the responsibility of the data quality at the source of input, sometimes at the vendor level.

Jean-Michel

 


MDM for Anything (MDM Summer Series Part 6)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we go across what we identified as thefive most frequent use cases of MDM for product data.

After reviewing all the other use cases, there is what I call MDM for “anything”.  This is in fact not a real facet, because the only common point of the product data that goes in this category is that they don’t belong to the aforementioned categories.

MDM for Product Data deals with the things that you are producing, and eventually also the things that you using to produce them.  The structure of those, together with the processes to create, distribute and use them, may be very specific to your activity: Professional Services companies, and more generally any companies that deals with large projects, have their work breakdown structures. Environmental Services have their waste types. Life Sciences companies have their compounds. And the list goes on. When those things are widely shared across activities and the enterprise is struggling with multiple data entry or reporting systems, resulting in process inefficiencies, then there is obviously a place for MDM for Product Data.

In case you have found in your business such "things" that would benefit from MDM, you may then feel a little lonely while searching for litterature, well documented business cases to inspire you while defining your ROI, or predefined models and templates. But the good news is that you should find food for thought from the other use cases anyway: the core discipline of Master Data will still apply, so you will be able to leverage best practices in order to create your shared definition of data, manage data accurary, establish data governance and stewardship, and connect your data sources and targets, applications, business processes and users to your master data.

Jean-Michel

 

Key Capabilities of MDM for Material Data (MDM Summer Series Part 7)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we’ve gone across what we identified as thefive most frequent use cases of MDM for product data. Now we are looking at the key capabilities that are needed in MDM platform to address each of these use cases. In this post, we address the MDM for Material Data.

As we will further discover through this post, strong data integration and data cleansing capabilities will be needed, together with the ability to model and maintain a uniform semantic view of the data across multiple models inherited from legacy applications and/or commercial off the shelf software. Standardization capabilities are important too, so that product can be easily browsed and searchable and product data can be electronically shared between business partners or regulation authorities when applicable. Strong stewardship capabilities will be needed as well.

MDM for Material Data apply for companies that are facing inefficiencies in their supply chain because of poor accuracy, accessibility or actionability of product data. This issue is generally not new, so there is often a lot of disparate data sources to consider, and a lot of moving parts too, sometimes hundreds of thousands or even millions.

Evaluating the efforts to create the single source of trusted data may not be straightforward, because the company may not know with precision the content of the data that they will have to deal with.

As a result, the MDM solution must enable progressive implementation. It may just start with gathering the data and assessing it with a data profiling environment before even starting to model the master data. Strong data quality management, including profiling, standardization, categorization and matching, together with flexible modeling, are therefore key capabilities needed. The standardization and categorizing capabilities should be of particular importance, because this will drive the accessibility of the product data: it will enable hierarchy browsing, faceted search or search by synonyms.   In some industries, codification standards exist to ease the exchange of production information between business partners or ensure compliance to regulation, so then your MDM solution should support those standards based on relatively complex well defined XML schemas.

The data integration capability is an also an important component for this use cases too:  The implementation style of MDM often starts generally with a so-called consolidation style where the data flows from the operational system to the MDM, in batch mode. Then the trusted data can flow back to the source system, through a batch mode too. Some companies prefer to start with a model where the MDM is connected real time in a bidirectional way to the source applications, but the prerequisite is that those existing applications have well defined access points such as web services that can be reused for the MDM project. This may be the case for companies that have achieved high maturity in adopting strong architectural standards for application interoperability across their IT landscape. Otherwise, the batch mode should be an easier, less intrusive and faster way to connect applications and start you MDM for Material Data initiative.

In any case, real time or batch, integration is an important building bloc, especially with regards to integration with ERP and PLM solutions. Because the data integration side would represent a significant portion of the development efforts needed to roll out and maintain this use case, the capabilities of your MDM solution in this area will impact significantly the total cost of ownership of the platform.   Note that the integration capability will be of particular value for companies that see their business applications progressively moving to the cloud. Analysts such as Ventana research have shown that cloud tends to augment the needs for efficient integration capabilities: for example, to fulfill a customer order, different cloud applications may be used from CRM to billing across warehouse management and distribution. MDM for material would then be a key component to provide a uniform and shareable product view across those applications, but then its ability to connect easily and efficiently to those external applications would be key.    

The data stewardship capabilities are important to consider too. In order to be implemented incrementally, MDM for Material Data should have minimal impact in the way material data is authored in the system, in the existing ERP and/or PLM applications.  The goal is generally not to re-engineer those steps, but to add data governance capability to ensure data accuracy, trustability and accessibility. As a result, data quality issues linked to reconciliation, accessibility, categorization and standardization of data has to be managed a posteriori in this use case: it will requires data stewardship involvement on an ongoing basis.

Continued on Part 8.

Jean-Michel

 

Key Capabilities of MDM for Lean Managed Services (MDM Summer Series Part 8)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we’ve gone across what we identified as thefive most frequent use cases of MDM for product data. Now we are looking at the key capabilities that are needed in MDM platform to address each of these use cases. In this post, we address the MDM for lean managed services , which is about creating and orchestrating a standardized and unified infrastructure from a heterogeneous landscape of legacy assets.

This use case is about creating a uniform and actionable view of a network of equipment. It generally starts with a very heterogeneous set of legacy systems. It may even happen that source systems are nothing more than spreadsheets and that the resulting system will replace them rather than coexist with them, requesting strong and rapid to implement data migration capabilities.  The data quality component from your MDM solution would then be of great help, not only to assess and profile your data, but also to help you standardize it. After all, your key objective is to drive efficiency while managing your equipment from the uniform and quality proofed view that your MDM provides. This should be an ongoing effort, requesting strong usability capabilities from your data quality component together with reporting and monitoring capabilities.    

Once populated, the MDM may become the place where the master data is authored, requesting this capability together with workflow in case of collaborative authoring.

Even more importantly your MDM should become a hub that can remotely triggers actions to the objects that it references. Strong ESB capabilities, including with advanced security, fault tolerance and audit trail management, are then mandated.

Jean-Michel

 

Key Capabilities of MDM for Regulated Products (MDM Summer Series Part 9)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we’ve gone across what we identified as thefive most frequent use cases of MDM for product data. Now we are looking at the key capabilities that are needed in MDM platform to address each of these use cases. In this post, we address the MDM for Regulated Products, which is about using MDM to support compliance to regulations related to products or facilitate data exchange related to products between business partners.

MDM for regulated products deals with standard codification. One of the most well established standards bodies for product is GS1, known not only for providing the Global Trade Item Number (GTIN) as the Universal Identifier for Consumer Goods and Healthcare Products, but more generally for standards enabling capture, identification, classification, sharing and traceability of product data between business partners.  According to Wikipedia, “GS1 has over a million member companies across the world, executing more than six billion transactions daily using GS1 standards”.

Complying with such standards mandates your MDM to be comfortable – and easy to deal with - the relatively complex semi-structured data that those standards mandate, such as EDI or XML data formats (for example, GS1 provides lets you choose between the “traditional” EDI format, EANCOM, and GS1 XML). The MDM platform should also allow data mappings between those well-defined standards and internal data structures, and to interactively viewing your products according to both your internal, “proprietary” views and the standardized one. Modeling capabilities such as hierarchy management is important too, and also inheritance to make sure that standards are embedded into your specific data models, and ensure that changes are automatically applied to all the data structures that aspire to conform with a standard.   

This use case also mandates strong capabilities from your data quality components, especially in terms of parsing, standardization, entity resolution and reconciliation. This may be the starting point to get standardized classifications out of your legacy product data, as it product categories may have been initially coded into long freeform text in legacy systems rather than in well-defined and structured attributes.  

Interfaces may be as “basic” as import and export but they could be much more sophisticated when the goal is to connect in real time business partners and regulatory institutions. Security and access control, and other high end capabilities found in an MDM that fully integrates an Enterprise Service Bus and provide capabilities as fault tolerance or audit trails, then become critical.  Workflow capabilities for data authoring and compliance checking might be important as well with that respect.

Continued on Part 10.

Jean-Michel

Key Capabilities of MDM for Product Information Management (MDM Summer Series Part 10)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we’ve gone across what we identified as thefive most frequent use cases of MDM for product data. Now we are looking at the key capabilities that are needed in MDM platform to address each of these use cases. In this post, we address the MDM for Product Information Management, which is about managing the customer facing side of build to stock products.

MDM for Product Information Management needs both a strong front-end and a back-end. The front-end holds the process of referencing product and making sure it holds the most complete and accurate data for efficient distribution. Some consider that it is the role of a Commercial Off-The-Shelf product that comes with pre-configured data models, business processes and workflow ; others prefer to work with a platform that they can customize to their own needs and connect to separate front-end platforms like e-commerce applications. Indeed, this would depend on the gap between the business need and what is proposed out of the box in the packaged software.

In the platform case, the solution must provide strong capabilities in terms of search and easily integrate with third party solutions that provide those kinds of capabilities such as online catalogs, digital asset management solutions, search based applications… Collaborative authoring capabilities, workflows and Business Process Management are important capability to consider, too. Same for the capabilities to categorize products according to classification standards (such as GS1 Global Product Categorization or ecl@ss) or custom hierarchies.

No matter what is the choice for the front end side of PIM, the back end side will be very important too. Its goal is to integrate the incoming data and to make sure that it is quality proofed before the product data is delivered to customer facing activities. Strong rule based data quality capabilities should be provided for that, in order to assess the incoming data, measure and analyze its quality, delegate its stewardship to right stakeholders in case of inaccuracy, and control the overall process through a rule based approach.

Then the solution must connect to the business partners. This can be done through a supplier portal, through open APIs, or by connecting to a Global Data Synchronization Network. Ability to map to standard format and connect to those external exchanges if applicable is then needed.

Jean-Michel

 

Key Capabilities of MDM for Anything, and Wrap-up (MDM Summer Series Part 11)

$
0
0

In this “summer series” of posts dedicated to Master Data Management for Product Data, we’ve gone across what we identified as thefive most frequent use cases of MDM for product data. Then, we looked at the key capabilities that are needed in MDM platform to address each of these use cases. In this last post of the series, we address the key capabilites needed for MDM for Anything, which is about dealing with the things that you are producing and/or the things that you using to produce them, for things that don’t fit to the four other facets of product master data described in this series.

MDM for anything refers to all the master data about product and things that are very specific to an industry, a use case, an enterprise… As this is specific, you would have to define on a case by case basis what is needed from your MDM solution, in terms of modeling, data quality, data accessibility, data stewardship, master data services... In any case, the flexibility of the solutions will key. By flexibility, I mean that the MDM solution should allow designing very specific data models, to connect easily to any source of data and eventually to applications in real time. 

The ability to adapt to any MDM implementation style, from very distributed to centralized, and from “after the fact” batch models to highly transactional, should be important too. According to Gartner, this corresponds to a new software category called multivector MDM software, defined as “the application of multidomain MDM across multiple industries, use case, organizational structures and implementation styles”.

This post concludes our summer series on Master Data Management for Product Data. We hope you enjoyed reading it! For your convenience, I am including here links to all posts of the series:

Introduction: The Five Facets of Product Data.

Use case and business benefits

Capabilities needed from an MDM solution

MDM for Material Data

Key Capabilities of MDM for Material Data

MDM for Lean Managed Services

Key Capabilities of MDM for Lean Managed Services

MDM for Regulated Products

Key Capabilities of MDM for Regulated Products

Product Information Management

Key Capabilities of MDM for Product Information Management

MDM for Anything

This post

In addition, you can look at this presentation on MDM and Talend MDM for Product Data: http://www.slideshare.net/jmfranco/talend-mdm-for-product-data.

Jean-Michel

 

Welcoming our New Head of Engineering

$
0
0

Today marks a major new milestone in Talend’s journey: we’re thrilled to announce that Laurent Bride is joining us as CTO.  Laurent is both a terrific manager and leader as well as a strong technologist who brings a wealth of experience to lead our engineering team in the coming years.  Most recently, Laurent was CTO at Axway and was responsible for R&D, Innovation and Product Management. His role was to take Axway’s products to the next level while ensuring quality and security of the solutions. Laurent was also very involved in M&A and post-integration activities.  Laurent has spent more than nine years in the Silicon Valley, with Business Objects and then SAP. During his tenure, Laurent has developed deep expertise in Enterprise Software Development, working with multi-national teams across the globe. His last role at SAP was SVP of Advanced Development, leading a 350 person team of developers building the next generation mobile, cloud, real-time analytics, M2M and big data solutions.  Laurent holds an engineering degree in mathematics and computer sciences from EISTI.

This milestone is a good opportunity to reflect on the remarkable work the team has done in the last 8 years.  Starting from nothing in 2006 and shipping the first product in 2007, the team built Talend into the leading open source integration company in the world.  They built a world-class team with a killer work ethic, consistent delivery, and supported by a leading edge engineering process and systems that will help the company continue to scale with the highest levels of quality and reliability. Their work made Talend the best in the world at what we do. Laurent has the privilege of starting with a winning team and continuing the growth and evolution that Fabrice and Cedric started.  We will do extraordinary things together in the coming years.

Mike

 


SPARK Certified

$
0
0

As the move to the next generation of integration platforms grows momentum, the need to implement a proven and scalable technology is critical. Databricks and Spark, delivered on the major Hadoop distributions, is one such area where the delivery of massively scalable technology low risk implementation is really key.

At Talend we see a wide array of batch processes, moving to an operational and real time perspective, driven by the consumers of the data. In this vein, the uptake in adoption and the growing community of Apache Spark, the powerful open-source processing engine, has been hard to miss.  In a relatively short time, it is now a part of every major Hadoop vendor’s offering, is the most active open source project in the Big Data space, and has been deployed in production across a number of verticals.

Traditional ETL and enterprise integration provides limitations, in both timeliness of extract, speed of load and importantly the supportability of the integration infrastructure itself. Spark delivers a new execution topology, allowing your Big Data platform to deliver much more than just ‘traditional’ Hadoop MapReduce tasks.

The power of integrating Spark and Talend Studio is that Talend users will get to immediately harness the power of Spark, all 80+ operations and sophisticated analytics, directly from the familiar and easy to use Talend Studio interface. With Talend and its unique approach to code generation, the code required to load data and execute a query in Spark is managed for you.  The designers simply need to identify the data, Talend can then provide the tools to deliver the data at the right time, in the right format and into the desired Hadoop environment.

Additionally, we’re thrilled to announce – in conjunction with Databricks - that Talend Studio is now officially “Certified on Spark”.  With the certification of Talend 5.5, the interoperability of your integration job created with Talend, and its execution on any Certified Spark Distribution is guaranteed.  It also means that as a technology user you can benefit from the power of the platforms without having to maintain your own detailed roadmap of component upgrade, update and continued refactoring of jobs – Talend manages this for you.  More broadly speaking, Talend is also supportive of the open and transparent nature of the certification process, which is designed to maintain compatibility within the Spark ecosystem while simultaneously encouraging innovation at the platform and application layers.

Beyond the technical integration, Talend Labs worked closely with the R&D team, based in Paris, to create an end-to-end scenario to showcase the key features and functions of the integrated Spark solution. This means that users have a fully-functional starting point, available from Talend and proven with Spark, to get you started on your journey.

 

Talend is always evolving its certification in line with its key partners and the Big Data Ecosystem, and Spark is no exception. With such a fast moving project, significant features and improvements are being rolled out rapidly. Talend is committed to supporting Spark and will be moving fast to certify and ensure compatibility with future Spark releases.

Gavin

Data Quality Everywhere

$
0
0

Data Quality follows the same principles than other well defined quality related processes: it is all about engaging an improvement cycle to Define & detect, Measure, Analyze, Improve and Control quality.

This doesn’t happen at one time, or one place. It should be an ongoing effort, and that is often neglected when dealing with data quality. Think about the big ERP, CRM or IT consolidation projects where data quality gets high attention during the roll out, and then fades away once the project is delivered.

The ubiquitous nature of quality is key, as well. We all have experienced that in the physical world: if you are a car manufacturer, you better have many quality checks across your manufacturing and supply chain. And you better identify the problems and root causes in the processes as early as possible in the chain. Think about the costs of recalling a vehicle at the end of the chain, once the product has been shipped to the customer, as experienced recently by Toyota who recalled 6 million vehicles for an estimated cost of $600 million. Quality should be a moving picture too: while progressing on your quality cycle, you have the opportunity to move upstream on your process. Take the example of General Electric, known for years as best in class for turning quality methodologies such as Six Sigma into the core his business strategy. Now they are pioneering the use of Big Data for the maintenance process in manufacturing.  Through this initiative, they moved beyond detecting quality defects as it happens: they are now able to predict them and operate the preventive maintenance in order to avoid them.

What has been experienced in the physical world of manufacturing applies in the digital world of information management as well. This means that you need to be able to position data quality controls and corrections everywhere in your information supply chain. And I see 6 usage scenarios for that.

The first one is applying data quality when data need to be repurposed. This usage scenarios is not new: it gave birth to the principles of data quality in IT systems. Most companies adopted it in the context of their business intelligence initiative: it mandates to consolidate data from multiple sources, typically operational systems, and get them ready for analysis. To support this usage scenarios, data quality tools can be provided as stand-alone tools with their own data marts or, even better, tightly bundled with Data integration tools.

A similar usage scenario, but “with steroids”, happens in the context of Big Data. Under this context, the role of data quality is to add a fourth V, the Veracity, to the well-known 3 V’s defining Big Data:  Volume, Variety and Velocity. We will cover Velocity later in the article. Managing extreme Volumes mandates new approaches for processing data quality: quality quality controls has to move where the data is, rather than the opposite way. Technically speaking, this means that data quality should run natively on big data environments such as Hadoop, and leverage its native distributing processing capabilities, rather that operate on top as a separate processing engine. Variety is also an important consideration: data may come in different form, like files, logs, databases, documents, or data interchange formats such as XML or JSON messages… Data quality would then need to turn the “oddly” structured data often seen in Big Data environments into something that is more structured and can be connected to the traditional enterprise business objects, like customers, products, employees, organizations… Data quality solutions should then provide strong capabilities in terms of profiling, parsing, standardization, entity resolution… Those capabilities can be provided before the data is stored, and designed by IT professionals. This is the traditional way to deal with data quality. Or, data preparation can delivered on an ad-hoc basis at run time, by data scientists or business users. This is sometimes referred to as data wrangling or data blending.

The third usage scenario lies in the ability to create data quality services. Data quality services allow applying data quality controls on demand. An example could be a web site with a web form to catch customer contacts information. Instead of letting a web visitor typing any data he wants in a web form, a data quality service could apply data quality checks in real time. This then gives the opportunity of checking info like e-mails, address, name of the company, phone number, etc. Even better, it can allow to automatically identifying a customer without requiring him to explicitly logon and/or provide contact informations, as social networks or best in class websites or mobile applications like Amazon.com already do. Going back to our automotive example, this case provides a way to cut the costs of data quality: such controls can be applied at the earliest steps of the information chain, even before erroneous data enters into the system. Marketing managers may be the best people to understand the value of such a usage scenario: they struggle with the poor quality of the contact data they get through internet. Once it has entered into the marketing database, poor data quality becomes very costly and badly impacts key activities such as segmenting, targeting, calculating customer value… Of course, the data can be cleansed at later steps than when a prospect or customer fills in a web form.  But this mandates significant efforts for resolution, and the related cost is then much higher this is the data is quality proofed at the entry point.

Then, there is quality for data in motion. It applies for data that flows for an application to another, for example an order that goes from sales to finance and then to logistics. As explained in the third usage scenario, it is a best practice that each system implements gatekeepers at the point of entry in order to reject data that doesn’t match its data quality standards. Data quality then needs to be applied in real time, under the control of an Enterprise Service Bus. This scenario for data quality can happen inside the enterprise and behind his firewall; this is the fourth usage scenario. Alternatively, data quality may also run  on the cloud, and this is the fifth scenario.

The last scenario is data quality for Master Data Management (MDM). In that context, data is standardized into a golden record, while the MDM acts as a single point of control. Applications and business users share a common view of the data related to entities such as customers, employees, products, chart of accounts, etc. The data quality then needs to be fully embedded in the master data environment and to provide deep capabilities in terms of matching and entity resolution.

Designing data quality solutions so that they can run across those scenarios is a driver for our design at Talend. Because one of the core capability of our unified platform is to generate code that can run everywhere, our data quality processing can run in any context, which we believe is key differentiator: our data quality component is delivered as a core component in all our platforms: it can be embedded into a data integration process, deployed natively in Hadoop as a Map Reduce job, and be exposed as a data quality service to any application that need to consume it in real time. It is delivered as well as a key capability of our application integration platform, our upcoming cloud platform, and, indeed, in our MDM platform. Even more importantly, data quality controls can move up into the information chain over time. Think about customer data that can be initially quality proofed in the context of a data warehouse through our data integration capabilities. Then, later, through MDM, this unified customer data can be shared across applications. Under this context, data stewards can then learn more about the data and be alerted that they are erroneous. This will help then to identify the root cause of bad data quality, for example a web form that brings junks e-mails into the customer database. Data services can then come to the rescue to avoid erroneous data inputs on the web form, and reconcile this entered data with the MDM through real time matching. And, finally Big Data could provide an innovative approach for identity resolution so that the customer can be automatically recognized by a cookie after he opts-in, turning the web form into retirement.

Indeed, such a  process doesn’t happen in one day. But remember the key principle of quality mentioned at the beginning of this post. Continuous improvement is the target!

Jean-Michel

 

Turning a Page

$
0
0

At the end of October, I will be leaving Talend, after more than 7 years leading its marketing charge. It has been quite a ride – thrilling, high octane, wearing at times, but how rewarding.

And indeed, how rewarding it is to have witnessed both the drastic change of open source over the years, and the rise of a true alternative response to integration challenges.

Everyone in the open source world knows this quote from the Mahatma Gandhi:

"First they ignore you, then they laugh at you, then they fight you, then you win."

And boy, do I recall our initial discussions with industry pundits and experts, not all of them were believers. I also remember the first struggles to convince IT execs of the value of our technology (even though their developers where users). And the criticism from “open source purists” about the “evil” open core model.

It would be preposterous to say that Talend has won the battle. But it is clearly fighting for (and winning) its fair share of business. And anyway, what does “winning the battle” mean in this context? We never aimed at putting the incumbents out of business (ok, maybe after a couple drinks, we might have boasted about it), but our goal has always been to offer alternatives, to make it easier and more affordable to adopt and leverage enterprise-grade integration technology.

Over these years, it has been a true honor to work with the founding team, with the world-class marketing team we have assembled, and of course with all the people who have made Talend what it is today. We can all be proud of what we have built, and the future is bright for Talend. The company is extremely well positioned, at the forefront of innovation, and with a solid team to take it forward, to the next step (world domination – not really, just kidding).

This is a small world, and I won’t be going very far I am sure. But in the meanwhile, since I won’t be contributing to the Talend blog anymore, I will start blogging about digitalization– of the enterprise, of society, of anything I can think about, really - and I might even rant about air travel or French strikes every now and then. I hope you will find it interesting.

Digitally yours,

Yves
@ydemontcheuil
Connect on LinkedIn
 

More Action, Less Talk - Big Data Success Stories

$
0
0

The term ‘big data’ is at risk of premature over-exposure. I’m sure there are already many who turn off when they hear it – thinking there’s too much talk and very little action. In fact, observing that ‘many companies don’t know where to start with big data projects’ has become the default opinion within the IT industry.

I however stand by the view that integration and analysis of this big data stands to transform today’s business world as we know it. And while it’s true that many firms are still unsure how and where to begin when it comes to drawing value from their data, there is a growing pool of companies to observe. Their applications might all be different; they may tend to be larger corporations rather than mid-range businesses, but there is no reason why companies of any size can’t still look and learn.

I was thinking this when several successful examples of how large volumes of data can be integrated and analysed came my way this week. The businesses involved were all from different industry sectors, from frozen foods to France’s top travel group.

What they have in common is that consumer demand, combined with the strength of competition in their own particular industry, is driving the need to gain some kind of deeper understanding of their business. For the former, Findus, this involves improving intelligence around its cold supply chain and gaining complete transparency and traceability.

For Karavel Promovacances, one of the largest French independent travel companies, it is more a question of integrating thousands upon thousands of travel options, including flights and hotel beds – and doing it at the speed that today’s internet users have come to expect. A third company, Groupe Flo is creating business intelligence on the preferences of the 25 million annual visitors to the firm’s over 300 restaurants.

Interestingly, the fourth and final case study involves a company which is dedicated to data. OrderDynamics analyses terabytes of data from its big-name retailer customers such as Neiman Marcus, Brooks Brothers, and Speedo, every day to provides real-time intelligence and recommendations on everything from price adjustments and inventory re-orders to content alterations.

As I said, these are four completely different applications from four companies at the top of their own particular games. But these applications are born from the knife-edge competitive spirit they need in order to maintain their positions. A need that drives innovation and inventiveness and turns the chatter about new technologies into achievement.

This drive or need won’t remain in the upper echelons of the corporate world forever. An increasing number of mid-range and smaller companies are discovering that there are open source solutions now on the market that effectively address the challenge of large-scale volumes. And, importantly, that they can tackle these projects cost-effectively.

This is bound to turn up the heat across the mainstream business world. In a recent survey by the Economist Intelligence Unit, 47% of executives said that they don’t expect to increase investments in big data over the next three years (with 37% referencing financial constraints as their barrier).  However, I believe this caution will soon give way as more firms learn of the relatively low cost of entry and, perhaps more significantly, as they see competitors inch ahead using a big data fueled business intelligence.

In other words, I expect to hear less talk and rather read more success stories in the months to come. Follow the links below to learn more about real world data success stories in high volume:

Karavel Promovacances Group (Travel and Tourism)

OrderDynamics (Retail/e-Tail)

Findus (Food)

Groupe Flo (Restaurant)

What Is a Container? (Container Architecture Series Part 1)

$
0
0

This is the first in a series of posts on container-centric integration architecture.  This first post covers common approaches to applying containers for application integration in an enterprise context.  It begins with a basic definition and discussion of the Container design patterns.  Subsequent posts will explore the role of Containers in the context of Enterprise Integration concerns.  This will continue with how SOA and Cloud solutions drive the need for enterprise management delivered via service containerization and the need for OSGI modularity.  Finally, we will apply these principles to explore two alternative solution architectures using OSGI service containers.

Containers are referenced everywhere in the Java literature but seldom clearly defined.  Traditional Java containers include web containers for JSP pages, Servlet containers such as Tomcat, EJB containers, and lightweight containers such as Spring.  Fundamentally, containers are just a framework pattern that provides encapsulation and separation of concerns for the components that use them.  Typically the container will provide mechanisms to address cross-cutting concerns like security or transaction management.  In contrast to a simple library, a container wraps the component and typically will also address aspects of classloading and thread control. 

Spring is the archetype container and arguably the most widely used container today.  Originally servlet and EJB containers had a programmatic API.  Most containers today follow Spring’s example in supporting Dependency Injection patterns.  Dependency Injection provides a declarative API for beans to obtain the resources needed to execute a method.  Declarative Dependency Injection is usually implemented using XML configuration or annotations and most frameworks will support both.  This provides a cleaner separation of concerns so that the bean code can be completely independent of the container API.

Containers are sometimes characterized as lightweight containers.  Spring is an example of a lightweight container in the sense that it can run inside of other containers such as a Servlet or EJB container.  “Lightweight” in this context refers to the resources required to run the container.  Ideally a container can address specific cross-cutting concerns and be composed with other containers that address different concerns.

Of course, lightweight is relative and how lightweight a particular container instance is depends on the modularity of the container design as well as how many modules are actually instantiated.  Even a simple Spring container running in bare JVM can be fairly heavyweight if a full set of transaction management and security modules are installed.  But in general a modular container like Spring will allow configuration of just those elements which are needed.

Open Source Containers typically complement Modularity with Extensibility.  New modules can be added to address other cross-cutting concerns.  If this is done in a consistent manner, an elegant framework is provided for addressing the full spectrum of design concerns facing an application developer.  Because containers decouple the client bean code from the extensible container modules, the cross-cutting features become pluggable.  In this manner, open source containers provide an open architecture foundation for application development.

Patterns are a popular way of approaching design concerns and they provide an interesting perspective on containers.  The Gang of Four Design Patterns[1] book categorized patterns as addressing Creation, Structure, or Behavior.  Dependency Injection can be viewed as a mechanism for transforming procedural Creation code into Structure.  Containers such as Spring also have elements of Aspect Oriented Code which essentially allow Dependency Injection of Behavior.  This allows transformation of Behavioral patterns into Structure as well.  This simplifies the enterprise ecosystem because configuration of structure is much more easily managed than procedural code.

Talend provides an open source container using Apache Karaf.  Karaf implements the OSGI standard that provides additional modularity and dependency management features that are missing in the Java specification.  The Talend ESB also provides a virtual service container based on enterprise integration patterns (EIP) via Apache Camel.  Together these provide a framework for flexible and open solution architectures that can respond to the technical challenges of Cloud and SOA ecosystems.

[1] Gamma Erich, Helm Richard, Johnson Ralph, Vlissides John (November 10, 1994). Design Patterns: Elements of Reusable Object-Oriented Software

 

Viewing all 824 articles
Browse latest View live