by Ana Jimenez Santamaria, Floor Drees, Natali Vlatko & Mirko Boehm
Track: Keynote
Room: Melody
The panel will discuss how EU legislation affects the daily work of open source operations (upstream contribution to open source projects, open source compliance, etc) focusing on how these laws impact open source professionals working in OSPOs or similar entities. Panelists will cover some of the recent policy updates, the challenges of staying compliant when managing open source contribution and usage within organizations, and their personal experiences in adapting to the changing European regulatory environment.
At the Community Over Code Europe 2024, the annual conference of the Apache Software Foundation, join us for an insightful session on understanding the core principle of ‘Community Over Code’. This talk will delve into how this philosophy shapes the foundation’s approach to software development. We’ll explore the significance of prioritizing a collaborative, inclusive community and how this fosters innovation and sustainability in open source projects. Attendees will learn about the practical implications of this ethos in Apache’s day-to-day operations and its impact on the broader open-source ecosystem.
Monitoring and management go hand in hand in distributed storage systems, and this is true for Apache Cassandra as well. Apache Cassandra has a wide range of monitoring and management API extensions that provide insight into its internal processes and make management operations accessible.
This talk will provide an overview of several new initiatives that are closely related from a software developer’s perspective. These initiatives follow the same design principles and their overall direction can be characterized as a strategic shift in both management and monitoring from JMX to CQL.
Retrieval-Augmented Generation (RAG) is probably one of the most popular implementations of LLMs that integrates retrieval and generation models, augmenting AI’s understanding of text and improving response accuracy through an information database.
This approach tackles the limitations of traditional generation models by fusing retrieval mechanisms, and enriching outputs with contextual depth and external knowledge.
Evaluating applications using LLMs, such as RAG, is pivotal for confidence and improvement, yet faces challenges like subjectiveness with respect to domain-specific suitability.
Apache IoTDB, a time series database focused on IoT workloads. At the end of 2022 the major version 1.0 was released which containted many major changes, e.g. a completely new cluster module, new export and import options and new APIS to integrate with Apache IoTDB.
In 2023 three minor releases, namely 1.1, 1.2 and 1.3 have been released but although they are considered minor releases they contain many new features.
The significance of responsible and ethical AI systems has gained immense prominence on the global stage, underscoring the escalating recognition of its far-reaching impact on societies worldwide. Lately, diverse groups and individuals have transitioned from relying solely on Free Software licenses for their projects to pioneering new forms of licensing solutions which impose restrictions related to fields of endeavour, behaviour, community management and commercial practices. This practice has now spilled over to creation of suo moto ethics codes for AI, leading to creation of licenses with restrictive characteristics.
The Accord Consensus Protocol, providing global leaderless single-network-round-trip consensus using commodity clocks.
Research from University of Michigan & Apple Inc. introduces ACID-compliant, strict serialisable transactions that can run globally at scale, at high throughput, with low latency.
This will be a run-through of
the importance of ACID transactions in Apache Cassandra,
how previous consensus protocols work,
how Accord improves on these to provide its industry leading characteristics.
Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser).
Apache Mynewt is a community-driven, permissively licensed Open Source initiative for constrained, embedded devices and applications. It provides foundational RTOS, middleware (secure bootloader, filesystem, networking stack, device management) and tooling.
This presentation will cover history and overall state of the project, its architecture and selected components (eg Bluetooth stack). Will discuss already available features as well additions planned in near future.
Last but not least, state of community will be presented.
Mentorship and outreach programs are often considered as side projects. Although they are a nice way to spend time and have some fun, one may say they rarely add new long-term contributors to your company project or a community. Is it true? Is it even the main goal? Or is it about team bonding and growing new maintainers and community leaders?
Let’s talk about organizing mentorship programs so that they help to grow your current community and contributors.
Cassandra predicted the fall of Troy and no one heeded her warning. Over the years, we’ve learned a lot at Bloomberg about running Apache Cassandra at scale. In this talk, we’ll discuss some of the mistakes we’ve made using Cassandra, how we found and remedied them, and what you can do to avoid them in the future.
Zomato is the Indian market leader for restaurant aggregation and food delivery.
This is a story of how Zomato leverages the power of Apache Solr at scale, some of the problems we faced and how we tackled them to reach where we are now.
Starting with just one Solr instance serving 1000 queries a day with 10K restaurants, to building and shipping a massive, first in class, search server with the capability of seamlessly searching 100 Million restaurant catalogues (SKUs), 800K+ restaurants and 50K+ unique dishes in 13 different languages at 64 Million search queries a day.
In industrial automation, the established way of collecting data from industrial equipment has many issues. Issues for which Apache provides a number of great ways to avoid them.
In this talk, I want to demonstrate some of the issues I have seen and how we can resolve all of them with combinations of some of the amazing projects we have at Apache:
How we can use Apache TsFile to directly collect data on the hardware
Managing an Open Source Program Office (OSPO) team is undoubtedly a unique experience. In this talk I will discuss the unique challenges and best practices for managing software developers in an OSPO environment. I will cover topics such as managing remote teams, maintaining the career progression of OSPO developers, fostering a culture of collaboration and collaborating with the open source community. I will also explore the role of performance metrics in managing software developers in an OSPO, how to align those metrics withe the goals of the organisation and how to help developers balance the needs of the organisation with the needs of the open source community.
As Cassandra clusters and users onboarding to multi cloud platforms, users may access Cassandra clusters from on-premise and from various cloud platforms. Admins may need to restrict certain users/teams to access from certain IP ranges, aka, CIDR groups. Also, may need to restrict superusers credentials usage from third party clouds.
CIDR filtering authorizer provides capability for admins to configure allow or disallow users access from different CIDR groups, which can help in preventing misuse of copied or hacked credentials.
Vector Search has been regarded as a revolution for Search and Information Retrieval since the breakthrough with BERT and GPT in 2018. There has been an ever growing interest from both academia and industry in using machine learning models and vector search to leverage all sorts of content beyond the lexical features, including images, events, contextual semantics and so on.
At Uber, we added the Vector Search support by leveraging Apache Lucene to our Search platform, and empower multiple business-critical use cases, such as Semantic Search and Gen AI support.
Apache Druid is a real-time analytics database built for speed and scale, capable of executing complex queries against billions of rows and getting sub-second answers. Druid thrives on highly concurrent workloads, making it ideal for applications like website clickstream analysis, network performance monitoring, or handling vast IoT metrics. By using pre-aggregated data, lightning-fast columnar storage, and parallel processing we can gain insights in real-time.
In this talk, we will share techniques for improving query concurrency and achieving sub-second responses.
As developer communities grow both in size and number, it can be easy to lose sight of the importance of community health in the development process. Most tech communities guide their community efforts and future work based on NPS scores and feedback forms. Though well intended, these methods do not provide a full view of the current status of their community members or guarantee they are moving towards a direction that benefits their members or fulfills their needs.
Choosing a compaction strategy for a Cassandra database has historically been a very difficult problem, where making the wrong choice can have lasting effects on performance while making a change later is a time-consuming and costly process.
The Unified Compaction Strategy, introduced with Cassandra 5, is designed to provide a solution to this problem by effectively handling a diverse range of use cases, including those best suited for leveled, tiered, and time-windowed compaction.
A critical aspect of any table format is the rapid identification of files relevant for a query irrespective of the underlying data volume. The focus of this presentation is on the job planning process in Apache Iceberg, highlighting its efficiency and ability to scale to tens of millions of files. This session will explain how the project leverages a hybrid strategy for planning jobs, seamlessly transitioning between local and distributed execution for optimal performance.
An update on what’s happened inside the Apache PLC4X over the last year. What we have achieved and what we are planning on doing for the near and not-so-near future.
From new protocols, updated APIs, new languages, GUI applications right up to even more supported languages and fully generated driver implementations.
This talk delves into the inner workings of the Apache Software Foundation board, shedding light on its workings and the responsibilities of its board members. Attendees will gain a comprehensive understanding of the ASF board’s governance structure, decision-making processes, and its crucial role in overseeing one of the world’s largest groups of open-source communities. Drawing from real-life experiences, the speaker will share personal insights, challenges faced, and the rewarding aspects of contributing to ASF’s mission.
Cassandra 5.0 now incorporates vector search capabilities powered by DiskANN, an advanced technology developed by Microsoft Research. In this session, we are will demonstrate the vector search performance of Cassandra 5.0, juxtaposed with other leading databases. Our benchmarking platform will be utilized to assess a variety of key metrics, including I/O performance as well as the precision and recall accuracy of the search results.
Furthermore, the session will delve into optimization strategies for Cassandra.
Apache Ratis is an open source Java library for the Raft Consensus Protocol. Raft is being used successfully as an alternative to Paxos to implement a consistently replicated log. Raft is proven to be safe and is designed to be simpler to understand. Ratis is a high performance implementation of Raft. Apache Ozone, Apache IoTDB and Alluxio use Apache Ratis for providing high availability and replicating raw data.
Ratis implements all the standard Raft features, including leader election, log replication, membership change and log compaction.
Why do you need another API to handle external traffic when you have the stable Kubernetes Ingress API and dozens of implementations? What problems of the Ingress API does the new Gateway API solve? Does this mean the end of the Ingress API?
In this short talk, Navendu will answer these questions by exploring how Gateway APIs evolved and solved the shortcomings of the Ingress API with hands-on examples using Apache APISIX.
In an increasingly interconnected world, the importance of building diverse and inclusive global communities cannot be overstated. This topic explores the journey of taking a local community and expanding its reach to become a vibrant and diverse global community. By examining strategies, best practices, examples of some successful Chinese opensource communities, this session will provide valuable insights into fostering inclusivity, cultural exchange, and collaboration on a global scale.
Participants will gain a deeper understanding of the challenges and opportunities involved in transitioning a local community to the global stage.
Discover the keys to success when releasing a podling within the Apache Incubator. This talk explores the crucial aspects that the incubator PMC looks for in every release, providing practical tips to pass the IPMC vote and move your project closer to graduation.
Learn about the latest incubator and ASF policies, recent updates you may have missed, and the legal requirements of open source licenses. Gain insights into assembling your NOTICE and LICENSE files effectively, while understanding the reasoning behind specific practices.
In the era of explosive data growth, scalability is paramount for any storage solution. This abstract focuses on the scalability aspects of Apache Ozone, a distributed object storage system designed to handle the ever-increasing demands of modern data-intensive applications.
The session will commence with an up-to-date overview of Apache Ozone, providing insights into its current state, recent enhancements, and its pivotal role in addressing the evolving needs of organizations. Attendees will gain a comprehensive understanding of how Apache Ozone offers scalable, high-performance, and future-ready solutions tailored to the challenges posed by today’s data-intensive applications.
All mature tech stacks nowadays offer infrastructure-related capabilities, either a standard lib or in 3rd-party libraries, e.g., rate-limiting and authorization. While it’s great to have such features, it’s impossible to audit them easily. You’d need to be familiar with the stack and dive deep into the code. This approach just doesn’t scale,
A well-designed system keeps the right feature at the right place. In this talk, I’ll go through all steps toward making your system more easily auditable.
The session will highlight the strategies employed by Apache to foster a more diverse and inclusive environment, emphasizing the importance of mentorship in nurturing new talents and perspectives. This approach not only enriches the Apache community ecosystem but also ensures that it reflects the wide array of users it serves. Attendees will gain insights into the practical steps for implementing similar programs and the profound impact of inclusivity on technology development.
In the ever-evolving landscape of open source projects, the Apache Software Foundation (ASF) stands at the forefront of innovation and community-driven development. Two of its young projects, Apache Training and Apache Wayang, are working on an exciting journey of expansion and inclusivity.
This session is dedicated to showcasing how these projects are opening their doors to a broader audience, including non-technical individuals, thereby fostering a more diverse and robust community, which helps the ASF to continue in solving some of the world’s tech problems by bringing people together.
by Zoltan Borok-Nagy, Péter Rózsa & Noémi Pap-Takács
Track: Big Data Storage
Room: Rhapsody
Apache Impala is a distributed, massively parallel query engine for big data. Initially, it focused on fast query execution on top of large datasets that were ingested via long-running batch jobs. The table schema and the ingested data typically remained unchanged, and row-level modifications were impractical to say the least.
Today’s expectations for modern data warehouse engines have risen significantly. Users now want to have RDBMS-like capabilities in their data warehouses.
Years ago the Service-oriented architecture (SOA) architectural style came along with implementations of web services based on standards like the Web Service Description Language (WSDL) and SOAP. Many of these interfaces are still in place as of today as a change requires both provider and all consumers to agree on a new definition and change the implementation (often without any business value). The underlying infrastructure, sometimes based on Enterprise Services Buses (ESB) is however often end-of-life and hard to maintain.
The rapid growth of the global open source community has led to the expansion of numerous projects, including the establishment of chapters in diverse regions such as Africa. This talk will explore the unique experiences and insights gained from leading an African chapter of the CHAOSS project, highlighting both the challenges faced and the victories achieved along the way. It will discuss the growth of the open source movement in Africa and emphasize the importance of building a diverse and inclusive community.
In the realm of sustainability, grassroots initiatives often emerge as powerful catalysts for change, driven by the collective wisdom of practitioners.
Our organization, a coalition of hundreds of software practitioners, embodies this ethos, operating on the principles of consensus and practical action. The result? Tangible solutions that directly foster meaningful change.
Enter Impact Framework, an open-source tool designed to quantify the environmental impact of software. It takes observations you can easily gather from running systems such as CPU utilisation, page views, installs, prompts and induces them into environmental impacts like carbon, waste, water.
Apache HBase is an open-source non-relational distributed database with multiple components such as Zookeeper, JournalNodes, Hmaster, Namenodes, Datanodes, Regionserver. Managing independent clusters for each use case is operationally heavy and sub-optimal utilization of hardware. Hence there is a need for providing a consolidated, managed, multi-tenant HBase cluster with stronger isolation guarantees in many organizations.
In this talk, we are going to talk about how we approached this problem, made tradeoffs and run large scale multi-tenant hbase clusters with strict isolation guarantees.
This session explores Fineract’s impact on banking transformation in fintech. It analyzes motivators driving core banking system changes, addressing challenges and innovative solutions.
From a client-focused view, it details how Fineract addresses banking sector needs, emphasizing adaptability and strategic advantages globally.
Real success cases and their metrics will demonstrate Fineract’s positive influence, driving innovation across financial landscapes. It also discusses regional fintech challenges and the potential solutions with Fineract as a fundamental piece.
When I started as the Instaclustr Technology Evangelist 7 years ago, I already had a background in computer science R&D and thought I knew a few things about architecting complex distributed systems. But it was still challenging to learn multiple new Apache (and other) Big Data technologies and build and scale realistic demonstration applications for domains such as IoT/logistics, fintech, anomaly detection, geospatial data, data pipelines and a drone delivery application - with streaming machine learning.
With more than 300 ASF projects being built thousands of times by developers and CI machines every day, making informed decisions about where to put the attention to accelerate build and test feedback cycles and increase the stability of the build process requires deep and holistic build data from which actionable insights can be derived. You will learn how Develocity aggregates the build data captured from dozens of Apache projects and >30k builds every week, surfacing surprising and interesting insights about how these projects are built and how the building of the software can be improved.
“In the evolving landscape of data platforms, the decoupling of compute and storage has led to the emergence of open data systems free from vendor constraints. However, this shift towards ““modularity”” brings its own set of challenges. The intricate task of establishing effective access controls within table-format architectures proves to be complex. Despite data residing in the cloud and theoretically accessible from anywhere, the existing friction impedes seamless accessibility.
Enter Whitefox: an open-source initiative inspired by the brilliant principles of Delta-Sharing.
Uncover the pivotal role of a Data Science Product Manager as they conduct a data-driven symphony in a high-volume Fintech environment.
In the world of product management, the role of a Data Science Product Manager stands out as a conductor orchestrating a symphony of insights. Join me in this session as I share firsthand experiences from my journey as a Data Science Product Manager at PayPal, delving into the challenges, successes, and failures that have shaped my approach to leading products in a data-rich environment.
With great data comes great responsibilities! Companies of every scale face issues of managing huge amounts of data spread across various platforms, databases, and applications.
Data federation offers a solution to this problem by integrating and accessing data from various data sources without the need for complex ETL processes or data duplication.
This session will delve into the following key aspects of data federation:
Introduction to Data Federation:
The problem today?
One of the mainstays of the open source ecosystem are community events. Open Source Summit, All Things Open, Community Over Code… all examples of community events with vitality and influence within open source. But unlike more commercially focused events, community events are not as simple to measure in terms of benefits to organizations that participate. Without sales leads or conversions, how does a commercial organization measure the gains of participation? And for community projects, what’s the return on investment in running a booth or giving talks at such events?
In this session, we will explore the potential of migrating from VMware to Apache CloudStack with KVM. VMware vSphere is a robust cloud infrastructure and management solution that combines vSphere and vRealize Suite, providing automation and operations capabilities for traditional and modern infrastructure and apps. However, the transition to Apache CloudStack can offer enhanced profitability and competitiveness.
We will delve into the benefits of Apache CloudStack, including its cost-effectiveness and open-source nature, and discuss how a gradual migration from VMware vCloud can reduce ownership costs, increase profitability, and enhance competitiveness.
The Digital Public Infrastructure movement has been gaining momentum globally as governments move to DPI-based approaches to create exponential societal outcomes within and across sectors. DPI is composed of open, interoperable technology with transparent, accountable, and participatory governance frameworks to unlock innovation and value at scale. This session will introduce how Apache projects like Fineract recognized as Digital Public Goods are having transformative impact on achieving SDGs.
Through presentation of the work Mifos has been undertaking over the past 12 months we will show how capabilities have been enhanced in Payment Hub EE combined with the power of Fineract to cover new use cases of P2G, Voucher Management and Account Mapping.
This session will introduce a platform created to bridge the existing gaps in data management while removing some of the complexities in existing Big Data ecosystem. The platform is built around a comprehensive data model describing structured entities and their relations. The model is consistently applied across three abstract types of storages - streaming (e.g. Apache Kafka, Google Cloud PubSub), batch (e.g. Hadoop HDFS, S3, Google Cloud Storage) and random-access (e.
by Daniel Augusto Veronezi Salvador, Bryan Lima, João Jandre Paraquetti & Rafael Weingärtner
Track: CloudStack
Room: Melody
Apache CloudStack (ACS) is a solid option among known cloud orchestration systems, being on the same level as OpenStack, Azure Stack, and others. All of them address the basic needs to create and run a private cloud system; however, ACS’s users have to adopt external solutions for rating/billing the resources consumption, which is native in the other orchestration tools (e.g. OpenStack). This presentation will address the design and efforts of the ACS community to implement a native rating feature that will allow more flexibility and reduce the need for external systems.
Apache Fineract has a wide range of built-in features, but most companies that integrate Fineract into their applications and services still require some customization of existing functionality or add new features. The usual approach is to fork the upstream project on Github and start right away editing the original code. This approach has a couple of drawbacks, especially that after a while of development the customization gets so complex that pulling changes from the upstream repository makes Git conflicts more likely and contributions back to the upstream project very difficult.
In this talk, I’ll walk you through the tricks and best practices to take your data pipeline game to the next level. No boring theory here - we’ll be talking real-world use cases.
Exploring which are the patterns for data pipeline with Airflow+Spark, Airflow+DBT, Airflow+Polars, how to avoid dependencies management on Airflow and resuse DAGs template on our organization.
Define which are the fundamental concepts of a Data Pipeline, from Data Lineage, Data Observability, Metadata, Data quality, Data auditing and how to integrate it on a Data Pipeline.
Apache Camel leads a seamless transition, taking control of 1000+ interfaces from Oracle SOA Suite.
Over the last two years, we have driven forward the integration of all retail systems from a centralised and proprietary system into a microservice-oriented architecture based on Apache Camel and Openshift.
The previously centralised gateways are now independent interfaces.
The challenge here was to lift the countless proprietary implementations to a system that is open to all.
CloudStack recently introduced a few hypervisor migration features, to help cloud operators migrate existing VM workloads into CloudStack. In this session, we are going to see how you can migrate instances from external KVM hosts to KVM hosts managed by CloudStack. Also, we are going to see how we can quickly deploy an instance from a previously prepared qcow2 image.
Since the first repayment strategy got introduced, many followed, but there was one thing common in them:
They were hard coding the allocation rules for each transaction type.
By introducing - part of the 1.9.0 release - the “Advanced payment allocation” the idea was to have a repayment strategy which was:
Supporting dynamic configuration of the allocation rules for transaction types
Supporting configuration of more fine-grained allocation rules for future installments
Geospatial data are ubiquitous, but the difficulty of handling them accurately is often under-estimated. Various projects implement their own routines for performing geospatial operations, but not always with awareness about the pitfalls of simple approaches. This talk will present some of the difficulties in mapping “real world” to digital data. Then we will present some international standards published jointly by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO).
Apache Camel is the proven integration swiss knife for years. In today’s world of workloads moving to the cloud, the need for disparate systems to communicate remains more than ever. This context makes a Kubernetes Java stack like Quarkus a good fit to implement Camel routes.
In this session, the attendance can first expect a quick reminder about Camel Quarkus basics. Beyond, some day to day useful features will be presented via concrete examples.
by João Jandre Paraquetti, Daniel Augusto Veronezi Salvador, Bryan Lima & Rafael Weingärtner
Track: CloudStack
Room: Melody
Apache CloudStack (ACS) and KVM are a combination that many organizations decided to adopt. KVM is a widely used hypervisor with a vibrant community and support in different operating system distributions. While developing the KVM plugin functionalities, one normally tries to make use of the full potential of the hypervisor; however, while Libvirt, the toolkit used by ACS to manage KVM VMs, already supports native incremental snapshots, every volume snapshot/backup taken with ACS is a full snapshot/backup.
In this presentation we delve into infrastructure optimization options for supporting the scalability of Fineract.
Key highlights of the session include:
Performance testing: Exploring the newly-introduced capabilities of Fineract that enable drilling down to performance bottlenecks during development and in production.
Performance improvements: Showing infrastructure and configuration changes that can improve Fineract’s response times and throughput under high-load scenarios.
Scalability improvements: Presenting improvements on Fineract’s scalability capabilities, focusing on infrastructure-based scaling velocity improvements.
The session will start by covering the latest developments made in hive-iceberg and followed by an overview of the work done to seamlessly integrate Hive and Iceberg. Along with a deep dive into the various cool features supported by hive-iceberg , ranging from statistics, branching tagging, compactions, concurrency and much more.
Apache Camel is the leading open-source integration framework that simplifies the integration of various systems and applications. There exists a comprehensive set of Tooling specifically designed to empower Camel developers in their work with Apache Camel within VS Code. These tools facilitate a seamless and efficient development experience, offering robust support and functionalities tailored to the needs of Camel developers.
In my session I would like to rely on the Extension Pack for Apache Camel which contains a set of specific extensions for Camel but also leverages the VS Code ecosystem.
In this session Wei will present how CloudStack 4.19 adds the capability to easily and quickly perform a light-touch integration of networking appliances with Apache CloudStack, allowing for operators and end users to offer a broader range of networking services while empowering end-users to effortlessly deploy their own virtualized network functions (VNFs).
Q&A is one of the most effective ways to obtain knowledge, build connections, and create interaction. In open-source communities, Q&A is particularly crucial. It not only provides a platform for users and developers to collaboratively tackle technical issues and clarify uncertainties but also enhances the sharing and circulation of knowledge. By helping each other in resolving issues, community members forge stronger bonds and jointly advance their projects. Additionally, a robust Q&A system attracts new members, injecting fresh perspectives and energy into the community.
This session explores the integrated use of Apache Toree, YuniKorn, Spark, and Airflow to create efficient, scalable data pipelines. We will start by discussing how Apache Toree provides an interactive analysis environment with Spark via Jupyter Notebook. Then, we’ll discuss using Apache YuniKorn to manage and schedule these computational resources, ensuring system efficiency. Central to our talk, we’ll delve into the role of Apache Spark in large-scale data processing, highlighting its integration with Toree and YuniKorn.
Collaborative governance in software is challenging. This presentation focuses on stakeholder participation which seems limited to those with the technical acumen, tooling expertise, and positions of influence. Yet, evidence shows that great collaboration is dependent on quality divergent thinking balanced with quality convergent thinking. This presentation lays out a strategic framework that curates broader participation by leveraging a landscape of networks and communication channels.
Governance in software development tends to exclude valuable insights from individuals outside the technical sphere.
Apache CloudStack integrates with two major SDN solutions, Tungsten Fabric (OpenSDN) for KVM environments and NSX for VMWare ESX environments. In this talk we’ll explore how this integrations were implemented, how to setup ACS Zones with these SDNs and explore their capabilities in regards to ACS.
to submit patches to a podling?
to release code to the public?
to maintain trademarks for a podling?
to become a committer on a podling?
This talk explains what common barriers are to accomplishing objectives of people and projects. It explains why The ASF has:
licensing requirements for code submissions and releases,
signing and checksums, download protocols,
voting requirements for releases and project membership,
trademark requirements for web sites and documentation.
Reading file formats efficiently is a crucial part of big data systems - in selective scans data is often only big before hitting the first filter and becomes manageable during the rest of the processing. The talk describes this early stage of query execution in Apache Impala, from reading the bytes of Parquet files on the filesystem to applying predicates and runtime filters on individual rows.
Apache Impala is a distributed massively parallel analytic query engine written in C++ and Java.
In this session, I share best practices on the way to create bar raising documentation to guide users to use Figma and GitHub templates. To scale best practices in UX Research, designers of open source software create various design artifacts that can help software builders use and improve on the open source code and curated experience offerings. In this talk, I offer examples of how OpenSearch creates process of research that can scale, the process of documentation and create templates that designers and developers in the open source community can utilize in developing experiences for their users.
Software has matured and is now an integral, key, part of society, its infrastructure and economy. Yet, by and large, the industries stance on security, reliability and preventing data leaks has fallen way behind. We’re regularly front-page news. So - like all important engineering industries before it - that means that politicians all over the world have started to care. And are introducing software regulation.
Europe leads that pack with the, now final, Cyber Resilience Act and the Product Liability Directive.
The path to successful progression through the ranks of an open-source community remains unclear. Historically, the quality and quantity of one’s technical skills have been essential components in progressing through the ranks in OSS communities. Because participants conduct much of this work in coding repositories, the demonstration of technical skills drives outcomes. However, given that individuals do not typically meet face to face, as they would in a conventional organisational setting, various on-line impression management techniques such as self-promotion (i.
The Asynchronous Decision Making techniques commonly used in open source projects enable efficient remote collaboration, in teams which have no boss, no schedule and often no cultural consistency yet produce world-changing software.
These very efficient collaboration techniques can even work without computers and apply to most types of projects, not just software development.
This talk describes the key elements and tools of the Asynchronous Decision Making process, based on more than twenty five years of experience in Open Source projects, as well as examples from federated governments, which, interestingly, work in a similar way.
One of the primary challenges of data ingestion is the tradeoff between the latency of data availability for the downstream systems and the extent to which data is optimised for efficient reading. When ingesting continuous incoming data streams with low latency, Apache Flink is a data processing engine that shines. Apache Iceberg is one of the most popular table formats for large tables. To get the best of both worlds, and continuously ingest data and see near real-time changes to tables queried by various engines, tight integration is needed between these two Apache projects.
In this insightful presentation, Aliaksandr will unveil four ingenious tricks to maximize your Apache Airflow experience in the realm of data engineering. Starting with the power of leveraging CSV files to effortlessly create versatile DAGs, Aliaksandr will demonstrate how this flexibility can streamline your pipeline development process. Moving forward, the audience will learn how Google Sheets can be harnessed as a dynamic tool for DAG creation, opening up opportunities for collaboration among team members of varying Airflow proficiency levels.
As HTTP/3 looks ready we will look to where we are with it in our servers.
The “old” HTTP/2 protocol and the corresponding TLS/SSL are common to Traffic Server, HTTP Server and Tomcat.
The presentation will shortly explain the new protocol and look to different implementation of the protocol.
Then the state of HTTP/3 in our 3 servers and how to implement HTTP/3 in them will be presented.
A small demo supporting HTTP/3 will be run.
For those of us who already know how important open source is, it can be challenging to persuasively make the case to management, because we assume that everyone already knows the basics. This can work against us, confusing our audience and making us come across as condescending or concerned about irrelevant lofty philosophical points.
In this talk, we take it back to the basics. What does management actually need to know about open source, why it matters, and how to make decisions about consuming open source, contributing to open source, and open sourcing company code?
For over a decade, Apache Zookeeper has played a crucial role in maintaining configuration information and providing synchronization within distributed systems. Its unique ability to provide these features made it the de facto standard for distributed systems within the Apache community.
Despite its prolific adoption, there is an emerging trend toward eliminating the dependency on Zookeeper altogether and replacing it with an alternative technology. The most notable example is the KRaft subproject within the Apache Kafka community,
Data enrichment is a critical step in stream processing. Real-time enrichment of streaming data with contextual information adds missing information, improves accuracy, increases trustworthiness, and facilitates better decision-making. Contextual data can be static or dynamic and obtained in various ways - APIs, databases, files and even as a stream. While there are multiple design patterns to perform data enrichment, it is not always obvious when one pattern is preferred over the other.
Apache Tomcat implements the Jakarta Servlet, Jakarta Pages, Jakarta Expression Language, Jakarta WebSocket and Jakarta Authentication specifications. Jakarta EE 11 is due for release in the first half of 2024 with the first stable Tomcat 11 release expected shortly afterwards.
This session will look at the changes in Jakarta EE 11 for the specifications that Tomcat implements and what these changes mean for developers looking to deploying their application on Tomcat 11.
Open source has widely grown to allow different tech career paths to enhance projects with their skills and provide jobs for those interested in working with open source. Open source contribution programs provide opportunities for interested individuals to become professionals.
Outreachy is a paid and remote internship OS program that empowers, grows talents, and prepares them for career growth. Outreachy provides internships to people subject to systemic bias and impacted by underrepresentation in the technical industry where they are living.
Apache Impala is a distributed massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables such as reading, writing, time traveling, and so on. However, in a big data environment it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables.
Managing complex applications such as data processing systems on Kubernetes is a formidable challenge even for the most seasoned engineers. Whether you want to build applications that operate themselves or provision infrastructure from Java code, Kubernetes Operators are the way to go.
The Java Operator SDK is a production-ready framework that makes implementing Kubernetes Operators in Java easy. We will give you a run-down on the basics of operators and implementing one from scratch in Java and why this library may be the right choice for your project.
The WebAssembly (Wasm) plugin for Apache Traffic Server (ATS) allows WebAssembly modules following the “proxy-wasm” specification to be run on ATS.
The talk will begin by first introducing the background and history of plugins and programmability of ATS. I will go over the short comings of the current offerings and then introduce the Wasm plugin as an alternative solution for them. I will then talk about the “proxy-wasm” specification, which describes the support of WebAssembly modules for proxy server software.
There are millions of open source projects people can use and contribute to. Why yours?
Developing an open source project that is valuable to many and widely accepted in an industry requires a lot of care and feeding – and more than just code. Whether your project is brand new or been around for decades, you need to explain why other people should take the time to learn, use, and potentially contribute to it.
Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs.
Apache Arrow has become a de-facto standard for representing large datasets and a very useful tool in any modern data engineering stack. While it allows different technologies to better communicate and share data, the ecosystem around it enables much more!
In this talk we’ll cover how Arrow interacts with Apache Spark - from allowing PySpark to interoperate with other Python data libraries, to building data-driven applications with the recent addition of Spark Connect.
This session explores the use of the FFM API from Java 22 to leverage native library capabilities, in the context of Apache Tomcat. OpenSSL is here being used to provide support for TLS through the JSSE API, without the need to use the tomcat-native wrapper library. Exploratory design of QUIC and HTTP/3 support from OpenSSL 3.3+ is also discussed.
How do you explain your Apache project to people who don’t even know how to download apps onto their phones – and still manage to get them excited about what you’re working on? It’s simple: pretend you’re talking about a movie.
The problem isn’t the project, but how we’ve been talking about them. And now we’re going to fix that.
In this talk, discover how to completely change the narrative about discussing Apache and open source by not actually talking about open source or Apache…but instead using the same principles that marketers use to create excitement around a movie.
The importance of security is increasing sharply nowadays which must be reflected in open source projects. Apache Spark and Apache Flink are two of the most widely used Big Data frameworks which can be used for data processing. Both of them offer dozens of external service connectors where authentication plays an essential role. Each external system does its authentication in a different way but a common framework can be provided to ease the life of developers.
Data quality plays a crucial role in data engineering to enable efficient and insightful data pipelines at scale. In this session, we will leverage Apache Iceberg as the scalable table format with ACID guarantee, Apache Toree’s interactive computation capabilities and orchestrate the automated data workflow on Apache Airflow. We will start by talking about how iceberg can use its column level statistics stored in metadata for efficient and reliable data quality validation.
This talk looks at using Groovy for a well-known data-science problem: classifying Iris flowers. It involves solving this problem using the latest deep-learning neural network technologies and has the option of using GraalVM for blazing speed. Groovy provides a data-science environment with the simplicity of Python but using Java-like syntax and your favourite JVM technologies.
In this presentation, we will delve into the important role that Apache Airflow plays in the Outreachy program and its broader influence in closing inclusion gaps within the open source community. We will explore the success stories and transformative experiences of Outreachy contributors, emphasizing how this open source project has created opportunities for people from diverse backgrounds. Our discussion will focus on the power of open source initiatives like Apache Airflow to foster a more inclusive and accessible technology ecosystem.
As machine learning (ML) models increasingly become integral components of modern applications, there is a growing need to deploy them in real-time environments. Apache Spark is a popular open-source framework for large-scale data processing that supports ML tasks, while Kubernetes provides a powerful platform for container orchestration and deployment. However, combining Spark and Kubernetes poses significant challenges, especially when it comes to achieving low latency and high scalability. In this session, we explore optimal approaches for real-time ML with Apache Spark on Kubernetes, including best practices and strategies for efficient model training, deployment, and serving.
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
In this session, Sergio del Amo introduces the Micronaut® framework and demonstrates how the Framework’s unique compile-time approach enables the development of ultra-lightweight Java applications.
Compelling aspects of the Micronaut framework include:
Develop applications with Java, Kotlin, or Apache Groovy
Sub-second startup time
Small processes that can run in as little as 10 MB of JVM heap
No runtime reflection
Dependency injection and AOP
Reflection-free serialization
A database access toolkit that uses ahead-of-time (AoT) compilation to pre-compute queries for repository interfaces.
It takes a village to run an open source project successfully. A village is usually run by its citizens and governed by some elected officials. In open source we call the citizens “users” and the people in charge of a project “maintainers”. To understand the health and sustainability of a project we should take a closer look at the community and not necessarily the code in the first place.
To understand their demographics a village can run a census.
Apache Flink is a powerful open-source stream processing framework that supports unified batch and streaming processing. With its SQL support, Flink has become even more accessible to data analysts and developers who are familiar with SQL.
In this talk, we will provide a short introduction to Apache Flink and explain how you can leverage SQL under the hood. We will cover some of the SQL-specific features of Flink, such as dynamic tables, streaming SQL and support for stateful operations like windowing and pattern recognition.
Kafka Streams, ksqlDB or Flink SQL are popular processing engines that enable us to run SQL queries on top of streaming data. Isn’t it fascinating that we can run SQL queries on top of streaming data as if they were relational tables, or convert a table into a stream of changelog events? This is known as the stream-table duality.
In this talk we will try to understand how it works under the hood using Flink SQL, Kafka connector with Debezium JSON/Avro format.
Calling all developers with a penchant for fine whiskey! Join Dr. Paul King, VP at Apache Groovy, on a quest to analyze whiskeys produced by the world’s top 86 distilleries to identify the perfect single-malt Scotch.
How will he perform this analysis? By using the traditional and distributed K-means clustering algorithm from various Apache projects. Bottoms up!